Library ecosystem

Questions

  • What happens when you need some method beyond what we discuss in this course, what is available?

  • How do you decide what to build on for your work?

Objectives

  • Know of some other available packages, but don’t necessarily know how to use them.

  • Be able to evaluate what you should reuse and what you should develop yourself.

You can’t do everything yourself. In fact, once we heard a quote such as this:

When you are a student, you are expected to do everything yourself, and that is how you are evaluated. When you become a researcher, you have to be able to reuse what others have done. We don’t have much practice in doing this. – A student

In this lesson, we’ll talk about the broader ecosystem in Python: all the resources you have available to you. Perhaps we can even classify this into two types:

  • Well-maintained libraries that are used by many others.

  • A wide variety of public code that might work but isn’t necessarily well-maintained (for example, code from articles).

We’ll start with the first then go to the second.

Glossary

Library

A collection of code used by a program.

Package

A library that has been made easily installable and reusable. Often published on public repositories such as the Python Package Index

Dependency

A requirement of another program, not included in that program.

The Python/SciPy ecosystem

This section is nothing more than a tour of what exists in Python. You aren’t expected to particularly remember any of these right now, but searching for these repositories is a starting point of a lot of future work.

The “core” packages could be considered. Many other packages build on these, and others that try to do similar things often try to conform to their interfaces (especially numpy):

  • Python

  • Numpy - arrays, everything builds on this

  • Scipy - scientific functions (not necessarily a lot builds on this)

  • matplotlib - plotting, many other plotting tools build on this

  • pandas - data structures

  • IPython / Jupyter: interactive work

Core numerics libraries

  • numpy - Arrays and array math.

  • scipy - Software for math, science, and engineering.

Plotting

  • matplotlib - Base plotting package, somewhat low level but almost everything builds on it.

  • seaborn - Higher level plotting interface; statistical graphics.

  • Vega-Altair - Declarative Python plotting.

  • mayavi - 3D plotting

  • Plotly - Big graphing library.

Data analysis and other important core packages

  • pandas - Columnar data analysi.

  • polars <https://pola.rs/> - Alternative to pandas that uses similar API, but is re-imagined for more speed.

  • Vaex - Alternative for pandas that uses similar API for lazy-loading and processing huge DataFrames.

  • Dask - Alternative to Pandas that uses similar API and can do analysis in parallel.

  • xarrray - Framework for working with mutli-dimensional arrays.

  • statsmodels - Statistical models and tests.

  • SymPy - Symbolic math.

  • networkx - Graph and network analysis.

  • graph-tool - Graph and network analysis toolkit implemented in C++.

Interactive computing and human interface

  • Interactive computing

    • IPython - Nicer interactive interpreter

    • Jupyter - Web-based interface to IPython and other languages (includes projects such as jupyter notebook, lab, hub, …)

  • Testing

    • pytest - Automated testing interface

  • Documentation

    • Sphinx - Documentation generator (also used for this lesson…)

  • Development environments

  • Binder - load any git repository in Jupyter automatically, good for reproducible research

Data format support and data ingestion

  • pillow - Image manipulation. The original PIL is no longer maintained, the new “Pillow” is a drop-in replacement.

  • h5py and PyTables - Interfaces to the HDF5 file format.

Speeding up code and parallelism

  • MPI for Python (mpi4py) - Message Passing Interface (MPI) in Python for parallelizing jobs.

  • cython - easily make C extensions for Python, also interface to C libraries

  • numba - just in time compiling of functions for speed-up

  • PyPy - Python written in Python so that it can internally optimize more.

  • Dask - Distributed array data structure for distributed computation

  • Joblib - Easy embarrassingly parallel computing

  • IPyParallel - Easy parallel task engine.

  • numexpr - Fast evaluation of array expressions by automatically compiling the arithmetic.

Machine learning

  • nltk - Natural language processing toolkit.

  • scikit-learn - Traditional machine learning toolkit.

  • xgboost - Toolkit for gradient boosting algorithms.

Deep learning

  • tensorflow - Deep learning library by Google.

  • pytorch - Currently the most popular deep learning library.

  • keras - Simple libary for doing deep learning.

  • huggingface - Ecosystem for sharing and running deep learning models and datasets. Incluses packages like transformers, datasets, accelerate, etc.

  • jax - Google’s Python library for running NumPy and automatic differentiation on GPUs.

  • flax - Neural network framework built on Jax.

  • equinox - Another neural network framework built on Jax.

  • DeepSpeed - Algorithms for running massive scale trainings. Included in many of the frameworks.

  • PyTorch Lightning - Framework for creating and training PyTorch models.

  • Tensorboard <https://www.tensorflow.org/tensorboard/> - Tool for visualizing model training on a web page.

Other packages for special cases

  • dateutil and pytz - Date arithmetic and handling, timezone database and conversion.

Connecting Python to other languages

As we discussed with Scipy, very many of the above packages aren’t written in Python: they are written in some other language and have a Python interface. Python is written in C, and thus has great C interfaces. This contributes to two things:

  • Extending Python by writing your own modules in C.

    • It’s actually common to first have (or write) an analysis package in C or C++, then make the Python interface. Then it can be supported by other languages, too.

    • Or one starts an analysis package in Python, and slowly moves bits of it to C over time as there is need.

  • Embedding Python, where you have another primary application that uses Python under the hood as an internal scripting language.

These features aren’t exactly unique to Python, but Python does support them very well. Read more: Extending and embedding Python.

Tools for interfacing with other languages

These days, one rarely directly extends the Python interpreter, but uses

  • cffi and ctypes - interface to C and compatible libraries

  • cython - easily make C extensions for Python, also interface to C libraries

  • f2py - interface to Fortran code

  • swig - connect to a variety of programming languages.

  • Boost.python - Another Python/C++ interface

  • TODO: Julia modules for Python?

Evaluating Python packages for reuse

Above, we talked about well-maintained mainstream packages. Do you trust random code you find online (for example included in a paper)?

Especially consider scientific results, which have to be correct. Still, you also can’t build everything yourself, so you have to carefully evaluate the situation.

Below are some things to consider:

  • Are there releases? Have they been going on for a while?

  • Are releases installable without copy-paste?

  • Are dependencies handled well?

  • Does the code randomly change, so that it no longer works with your code. Is this relevant?

  • Is there good documentation, that not just tells how to use it but how it works?

  • Is there automated testing? What’s your evaluation of the risk of undetectable scientific errors?

  • Is there a community, or is it one person? Is it backed by some organization? Does it have a permanent home?

  • Is it is a public hosting site (GitLab, GitHub, Bitbucket, etc) where a community could form?

  • Do others post issues and make contributions? Are these issues dealt with in a timely manner? Can you search past bug reports?

  • Is the software citeable?

Is your work reuseable?

Every small project you do contributes a little bit to the Python and SciPy ecosystem. This course has sort of started you on that path, and a CodeRefinery workshop will make sure you have the tools to produce high-quality, reusable code.

What’s next?

  • The CodeRefinery workshop mentioned above will prepare you for others to reuse your code and for you to contribute to other code.

  • The upcoming Dependency management lesson will teach you how to record and manage dependencies so that anyone can seamlessly reuse your code.

Exercises

Libraries 1.1: Libraries in your work

What libraries do you use in your work? What have you made, which you could have reused from some other source. What have you used from some other source that you wished you had re-created?

Discuss in your groups or HackMD.

Libraries 1.2: Evaluating packages

Below are some links to some packages, both public and made by the authors of this lesson. Evaluate them, considering “would I use this in my project?”

  1. https://github.com/networkx/networkx/

  2. some code on webpage in a paper’s footnote

  3. https://github.com/rkdarst/pcd

  4. https://github.com/dftlibs/numgrid

  5. https://github.com/rkdarst/dynbench

  6. https://vpython.org/

See also

Keypoints

  • Almost everything you need can already be found, except your incremental work.

  • When do you build on that other work, and when do you create things yourself?