In depth analysis of some selected file formats

Here is a selection of file formats that are commonly used in data science. They are somewhat ordered by their intended use.

Storing arbitrary Python objects

Pickle

Key features

  • Type: Binary format

  • Packages needed: None (pickle-module is included with Python).

  • Space efficiency: 🟨

  • Arbitrary data:

  • Tidy data: 🟨

  • Array data: 🟨

  • Long term archival/sharing: ❌! See warning below.

  • Best use cases: Saving Python objects for debugging.

Warning

Loading pickles that you have not created is risky as they can contain arbitrary executable code.

Do not unpickle objects from sources that you do not trust!

Pickle is Python’s own serialization library. It allows you to store Python objects into a binary file, but it is not a format you will want to use for long term storage or data sharing. It is best suited for debugging your code by saving the Python variables for later inspection:

import pickle

with open('data_array.pickle', 'wb') as f:
    pickle.dump(data_array, f)

with open('data_array.pickle', 'rb') as f:
    data_array_pickle = pickle.load(f)

Exercise 1

Exercise

  • Create an arbitrary python object (for example, a string or a list). Pickle it.

    Read the pickled object back in and check if it matches the original one.

Storing tidy data

CSV (comma-separated values)

Key features

  • Type: Text format

  • Packages needed: numpy, pandas

  • Space efficiency:

  • Arbitrary data:

  • Tidy data:

  • Array data: 🟨

  • Long term archival/sharing:

  • Best use cases: Sharing data. Small data. Data that needs to be human-readable.

CSV is by far the most popular file format, as it is human-readable and easily shareable. However, it is not the best format to use when you’re working with big data.

Pandas has a very nice interface for writing and reading CSV files with to_csv- and read_csv-functions:

dataset.to_csv('dataset.csv', index=False)

dataset_csv = pd.read_csv('dataset.csv')

Numpy has routines for saving and loading arrays as CSV files as well:

np.savetxt('data_array.csv', data_array)

data_array_csv = np.loadtxt('data_array.csv')

Feather

Key features

  • Type: Binary format

  • Packages needed: pandas, pyarrow

  • Space efficiency:

  • Arbitrary data:

  • Tidy data:

  • Array data:

  • Long term archival/sharing:

  • Best use cases: Temporary storage of tidy data.

Feather is a file format for storing data frames quickly. There are libraries for Python, R and Julia.

We can work with Feather files with to_feather- and read_feather-functions:

dataset.to_feather('dataset.feather')
dataset_feather = pd.read_feather('dataset.feather')

Feather is not a good format for storing array data, so we won’t present an example of that here.

Parquet

Key features

  • Type: Binary format

  • Packages needed: pandas, pyarrow

  • Space efficiency:

  • Arbitrary data: 🟨

  • Tidy data:

  • Array data: 🟨

  • Long term archival/sharing:

  • Best use cases: Working with big datasets in tidy data format. Archival of said data.

Parquet is a standardized open-source columnar storage format that is commonly used for storing big data. Parquet is usable from many different languages (C, Java, Python, MATLAB, Julia, etc.).

We can work with Parquet files with to_parquet- and read_parquet-functions:

dataset.to_parquet('dataset.parquet')
dataset_parquet = pd.read_parquet('dataset.parquet')

Parquet can be used to store arbitrary data and arrays as well, but doing that is more complicated so we won’t do that here.

Exercise 2

Exercise

  • Create the example dataset:

    import pandas as pd
    import numpy as np
    
    n_rows = 100000
    
    dataset = pd.DataFrame(
        data={
            'string': np.random.choice(('apple', 'banana', 'carrot'), size=n_rows),
            'timestamp': pd.date_range("20130101", periods=n_rows, freq="s"),
            'integer': np.random.choice(range(0,10), size=n_rows),
            'float': np.random.uniform(size=n_rows),
        },
    )
    
  • Save the dataset dataset as CSV. Load the dataset into a variable dataset_csv.

  • Use dataset.compare(dataset_csv) to check if loaded dataset matches the original one.

Storing array data

npy (numpy array format)

Key features

  • Type: Binary format

  • Packages needed: numpy

  • Space efficiency: 🟨

  • Arbitrary data:

  • Tidy data:

  • Array data:

  • Long term archival/sharing:

  • Best use cases: Saving numpy arrays temporarily.

If you want to temporarily store numpy arrays, you can use the numpy.save()- and numpy.load()-functions:

np.save('data_array.npy', data_array)
data_array_npy = np.load('data_array.npy')

There also exists numpy.savez()-function for storing multiple datasets in a single file:

np.savez('data_arrays.npz', data_array0=data_array, data_array1=data_array)
data_arrays = np.load('data_arrays.npz')
data_arrays['data_array0']

For big arrays it’s good idea to check other binary formats such as HDF5 or NetCDF4.

np.save- and np.savez-functions work with sparse matrices, but one can also use dedicated scipy.sparse.save_npz- and scipy.sparse.load_npz-functions. Storing sparse matrices using these functions can give huge storage savings.

HDF5 (Hierarchical Data Format version 5)

Key features

  • Type: Binary format

  • Packages needed: numpy, pandas, PyTables, h5py

  • Space efficiency:

  • Arbitrary data:

  • Tidy data:

  • Array data:

  • Long term archival/sharing:

  • Best use cases: Working with big datasets in array data format.

HDF5 is a high performance storage format for storing large amounts of data in multiple datasets in a single file. It is especially popular in fields where you need to store big multidimensional arrays such as physical sciences.

Pandas allows you to store tables as HDF5 with PyTables, which uses HDF5 to write the files. You can create a HDF5 file with to_hdf- and read_parquet-functions:

dataset.to_hdf('dataset.h5', key='dataset', mode='w')
dataset_hdf5 = pd.read_hdf('dataset.h5')

For writing data that is not a table, you can use the excellent h5py-package:

import h5py

# Writing:

# Open HDF5 file
h5_file = h5py.File('data_array.h5', 'w')
# Write dataset
h5_file.create_dataset('data_array', data=data_array)
# Close file and write data to disk. Important!
h5_file.close()

# Reading:

# Open HDF5 file again
h5_file = h5py.File('data_array.h5', 'r')
# Read the full dataset
data_array_h5 = h5_file['data_array'][()]
# Close file
h5_file.close()

NetCDF4 (Network Common Data Form version 4)

Key features

  • Type: Binary format

  • Packages needed: pandas, netCDF4/h5netcdf, xarray

  • Space efficiency:

  • Arbitrary data:

  • Tidy data:

  • Array data:

  • Long term archival/sharing:

  • Best use cases: Working with big datasets in array data format. Especially useful if the dataset contains spatial or temporal dimensions. Archiving or sharing those datasets.

NetCDF4 is a data format that uses HDF5 as its file format, but it has standardized structure of datasets and metadata related to these datasets. This makes it possible to be read from various different programs.

NetCDF4 is a common format for storing large data from big simulations in physical sciences.

Using interface provided by xarray:

# Write tidy data as NetCDF4
dataset.to_xarray().to_netcdf('dataset.nc', engine='h5netcdf')
# Read tidy data from NetCDF4
import xarray as xr
dataset_xarray = xr.open_dataset('dataset.nc', engine='h5netcdf')
dataset_netcdf4 = dataset_xarray.to_pandas()
dataset_xarray.close()

Working with array data is easy as well:

# Write array data as NetCDF4
xr.DataArray(data_array).to_netcdf('data_array.nc', engine='h5netcdf')
# Read array data from NetCDF4
data_array_xarray = xr.open_dataarray('data_array.nc', engine='h5netcdf')
data_array_netcdf4 = data_array_xarray.to_numpy()
data_array_xarray.close()

The advantage of NetCDF4 compared to HDF5 is that one can easily add other metadata e.g. spatial dimensions (x, y, z) or timestamps (t) that tell where the grid-points are situated. As the format is standardized, many programs can use this metadata for visualization and further analysis.

Exercise 3

Exercise

  • Create an example numpy array:

    n = 1000
    
    data_array = np.random.uniform(size=(n,n))
    
  • Store the array as a npy.

  • Read the dataframe back in and compare it to the original one. Does the data match?

Other file formats

JSON (JavaScript Object Notation)

Key features

  • Type: Text format

  • Packages needed: None (json-module is included with Python).

  • Space efficiency:

  • Arbitrary data: 🟨

  • Tidy data:

  • Array data:

  • Long term archival/sharing:

  • Best use cases: Saving nested/relational data, storing web requests.

JSON is a popular human-readable data format. It is especially common when dealing with web applications (REST-APIs etc.).

You rarely want to keep your data in this format, unless you’re working with nested data with multiple layers or lots of interconnections.

Similarly to other popular files, Pandas can write and read json files with to_json()- and read_json()-functions:

dataset.to_json('dataset.json')
dataset_json = pd.read_json('dataset.json')

Excel

Key features

  • Type: Text format

  • Packages needed: openpyxl

  • Space efficiency:

  • Arbitrary data:

  • Tidy data: 🟨

  • Array data:

  • Long term archival/sharing:

  • Best use cases: Sharing data in many fields. Quick data analysis.

Excel is very popular in social sciences and economics. However, it is not a good format for data science.

See Pandas’ documentation on working with Excel files.

Graph formats (adjency lists, gt, GraphML etc.)

Key features

  • Type: Many different formats

  • Packages needed: Depends on a format.

  • Space efficiency: 🟨

  • Arbitrary data:

  • Tidy data:

  • Array data:

  • Long term archival/sharing: 🟨

  • Best use cases: Saving graphs or data that can be represented as a graph.

There are plenty of data formats for storing graphs. We won’t list them here as optimal data format depends heavily on the graph structure.

One can use functions in libraries such as networkx, graph-tool, igraph to read and write graphs.