Data formats with Pandas and Numpy

Questions

  • How do you store your data right now?

  • Are you doing data cleaning / preprocessing every time you load the data?

Objectives

  • Learn the distinguishing characteristics of different data formats.

  • Learn how Pandas to read and write data in a variety of formats.

What is a data format?

Whenever you have data (e.g. measurement data, simulation results, analysis results), you’ll need a way to store it. This applies both when

  1. you’re storing the data in memory while you’re working on it;

  2. you’re storing it to a disk for later work.

Let’s consider this randomly generated dataset with various columns:

import pandas as pd
import numpy as np

n_rows = 100000

dataset = pd.DataFrame(
    data={
        'string': np.random.choice(('apple', 'banana', 'carrot'), size=n_rows),
        'timestamp': pd.date_range("20130101", periods=n_rows, freq="s"),
        'integer': np.random.choice(range(0,10), size=n_rows),
        'float': np.random.uniform(size=n_rows),
    },
)

dataset.info()

This DataFrame already has a data format: it is in the tidy data format! In tidy data format we have multiple columns of data that are collected in a Pandas DataFrame.

../_images/tidy_data.png

Let’s consider another example:

n = 1000

data_array = np.random.uniform(size=(n,n))
data_array

Here we have a different data format: we have a two-dimentional array of numbers! This is different to Pandas DataFrame as data is stored as one contiguous block instead of individual columns. This also means that the whole array must have one data type.

https://github.com/elegant-scipy/elegant-scipy/raw/master/figures/NumPy_ndarrays_v2.png

Source: Elegant Scipy

Now the question is: can we store these datasets in a file in a way that keeps our data format intact?

For this we need a file format that supports our chosen data format.

Pandas has support for many file formats for tidy data and Numpy has support for some file formats for array data. However, there are many other file formats that can be used through other libraries.

What to look for in a file format?

When deciding which file format you should use for your program, you should remember the following:

There is no file format that is good for every use case.

Instead, there are various standard file formats for various use cases:

https://imgs.xkcd.com/comics/standards.png

Source: xkcd #927.

Usually, you’ll want to consider the following things when choosing a file format:

  1. Is everybody else / leading authorities in my field using a certain format? Maybe they have good reasons for using it.

  2. Is the file format good for my data format (is it fast/space efficient/easy to use)?

  3. Do I need a human-readable format or is it enought to work on it using programming languages?

  4. Do I want to archive / share the data or do I just want to store it while I’m working?

Exercise 1

Exercise

  • Create the example dataframe dataset with:

    import pandas as pd
    import numpy as np
    
    n_rows = 100000
    
    dataset = pd.DataFrame(
        data={
            'string': np.random.choice(('apple', 'banana', 'carrot'), size=n_rows),
            'timestamp': pd.date_range("20130101", periods=n_rows, freq="s"),
            'integer': np.random.choice(range(0,10), size=n_rows),
            'float': np.random.uniform(size=n_rows),
        },
    )
    
  • Use the %timeit-magic to calculate how long it takes to save / load the dataset as a CSV-file.

Exercise 2

Exercise

  • Save the dataset dataset using a binary format of your choice.

  • Use the %timeit-magic to calculate how long it takes to save / load the dataset.

  • Did you notice any difference in speed?

Exercise 3

Exercise

  • Create a numpy array. Store it as a npy.

  • Read the dataframe back in and compare it to the original one. Does the data match?

Benefits of binary file formats

Binary files come with various benefits compared to text files.

  1. They can represent floating point numbers with full precision.

  2. Storing data in binary format can potentially save lots of space. This is because you do not need to write numbers as characters. Additionally some file formats support compression of the data.

  3. Data loading from binary files is usually much faster than loading from text files. This is because memory can be allocated for the data before data is loaded as the type of data in columns is known.

  4. You can often store multiple datasets and metadata to the same file.

  5. Many binary formats allow for partial loading of the data. This makes it possible to work with datasets that are larger than your computer’s memory.

Performance when writing tidy dataset:

For the tidy dataset we had, we can test the performance of the different file formats:

File format

File size [MB]

Write time [ms]

Read time [ms]

CSV

4.571760

0.296015

0.072096

Feather

2.202471

0.013013

0.007742

Parquet

1.820971

0.009052

0.009052

HDF5

4.892181

0.037609

0.033721

NetCDF4

6.894043

0.073829

0.010776

The relatively poor performance of HDF5-based formats in this case is due to the data being mostly one dimensional columns full of character strings.

Performance when writing data array:

For the array-shaped data_array we had, we can test the performance of the different file formats:

File format

File size [MB]

Write time [ms]

Read time [ms]

CSV

23.841858

0.647893

0.639863

npy

7.629517

0.009885

0.002539

HDF5

7.631348

0.012877

0.002737

NetCDF4

7.637207

0.018905

0.009876

For this kind of a data, HDF5-based formats perform much better.

Things to remember

  1. There is no file format that is good for every use case.

  2. Usually, your research question determines which libraries you want to use to solve it. Similarly, the data format you have determines file format you want to use.

  3. However, if you’re using a previously existing framework or tools or you work in a specific field, you should prioritize using the formats that are used in said framework/tools/field.

  4. When you’re starting your project, it’s a good idea to take your initial data, clean it, and store the results in a good binary format that works as a starting point for your future analysis. If you’ve written the cleaning procedure as a script, you can always reproduce it.

  5. Throughout your work, you should use code to turn important data to human-readable format (e.g. plots, averages, DataFrame.head()), not to keep your full data in a human-readable format.

  6. Once you’ve finished, you should store the data in a format that can be easily shared to other people.

Other file formats

Pickle

Key features

  • Type: Binary format

  • Packages needed: None (pickle-module is included with Python).

  • Space efficiency: Ok.

  • Good for sharing/archival: No! See warning below.

  • Tidy data:
    • Speed: Ok

    • Ease of use: Ok

  • Array data:
    • Speed: Ok

    • Ease of use: Ok

  • Best use cases: Saving Python objects for debugging.

Warning

Loading pickles that have been provided from untrusted sources is risky as they can contain arbitrary executable code.

Pickle is Python’s own serialization library. It allows you to store Python objects into a binary file, but it is not a format you will want to use for long term storage or data sharing. It is best suited for debugging your code by saving the Python variables for later inspection:

import pickle

with open('data_array.pickle', 'wb') as f:
    pickle.dump(data_array, f)

with open('data_array.pickle', 'rb') as f:
    data_array_pickle = pickle.load(f)

JSON (JavaScript Object Notation)

Key features

  • Type: Text format

  • Packages needed: None (json-module is included with Python).

  • Space efficiency: Ok.

  • Good for sharing/archival: No! See warning below.

  • Tidy data:
    • Speed: Ok

    • Ease of use: Ok

  • Array data:
    • Speed: Ok

    • Ease of use: Ok

  • Best use cases: Saving Python objects for debugging.

JSON is another popular human-readable data format. It is especially common when dealing with web applications (REST-APIs etc.). However, when you’re working with big data, you rarely want to keep your data in this format.

Similarly to other popular files, Pandas can write and read json files with to_json- and read_json-functions:

dataset.to_json('dataset.json')
dataset_json = pd.read_csv('dataset.json')

However, JSON is often used to represent hierarchical data with multiple layers or multiple connections. For such data you might need to do a lot more processing.

Excel (binary)

Key features

  • Type: Text format

  • Packages needed: openpyxl

  • Space efficiency: Bad.

  • Good for sharing/archival: Maybe.

  • Tidy data:
    • Speed: Bad

    • Ease of use: Good

  • Array data:
    • Speed: Bad

    • Ease of use: Ok

  • Best use cases: Sharing data in many fields. Quick data analysis.

Excel is very popular in social sciences and economics. However, it is not a good format for data science.

See Pandas’ documentation on working with Excel files.

Using Excel files with Pandas requires openpyxl-package to be installed.

See also

Keypoints

  • Pandas can read and write a variety of data formats.

  • There are many good, standard formats, and you don’t need to create your own.

  • There are plenty of other libraries dedicated to various formats.