Library ecosystem

Questions

What happens when you need some method beyond what we discuss in this course, what is available?
How do you decide what to build on for your work?

Objectives

Know of some other available packages, but don’t necessarily know how to use them.
Be able to evaluate what you should reuse and what you should develop yourself.

You can’t do everything yourself. In fact, once we heard a quote such as this:

When you are a student, you are expected to do everything yourself, and that is how you are evaluated. When you become a researcher, you have to be able to reuse what others have done. We don’t have much practice in doing this. – A student

In this lesson, we’ll talk about the broader ecosystem in Python: all the resources you have available to you. Perhaps we can even classify this into two types:

Well-maintained libraries that are used by many others.
A wide variety of public code that might work but isn’t necessarily well-maintained (for example, code from articles).

We’ll start with the first then go to the second.

Glossary

Library: A collection of code used by a program.
Package: A library that has been made easily installable and reusable. Often published on public repositories such as the Python Package Index
Dependency: A requirement of another program, not included in that program.

The Python/SciPy ecosystem

This section is nothing more than a tour of what exists in Python. You aren’t expected to particularly remember any of these right now, but searching for these repositories is a starting point of a lot of future work.

The “core” packages could be considered. Many other packages build on these, and others that try to do similar things often try to conform to their interfaces (especially numpy):

Python
Numpy - arrays, everything builds on this
Scipy - scientific functions (not necessarily a lot builds on this)
matplotlib - plotting, many other plotting tools build on this
pandas - data structures
IPython / Jupyter: interactive work

Core numerics libraries

numpy - Arrays and array math.
scipy - Software for math, science, and engineering.

Plotting

matplotlib - Base plotting package, somewhat low level but almost everything builds on it.
seaborn - Higher level plotting interface; statistical graphics.
Vega-Altair - Declarative Python plotting.
mayavi - 3D plotting
Plotly - Big graphing library.

Data analysis and other important core packages

pandas - Columnar data analysi.
polars <https://pola.rs/> - Alternative to pandas that uses similar API, but is re-imagined for more speed.
Vaex - Alternative for pandas that uses similar API for lazy-loading and processing huge DataFrames.
Dask - Alternative to Pandas that uses similar API and can do analysis in parallel.
xarrray - Framework for working with mutli-dimensional arrays.
statsmodels - Statistical models and tests.
SymPy - Symbolic math.
networkx - Graph and network analysis.
graph-tool - Graph and network analysis toolkit implemented in C++.

Interactive computing and human interface

Interactive computing
- IPython - Nicer interactive interpreter
- Jupyter - Web-based interface to IPython and other languages (includes projects such as jupyter notebook, lab, hub, …)
Testing
- pytest - Automated testing interface
Documentation
- Sphinx - Documentation generator (also used for this lesson…)
Development environments
- Spyder - Interactive Python development environment.
- Visual Studio Code - Microsoft’s flagship code editor.
- PyCharm - JetBrains’s Python IDE.
Binder - load any git repository in Jupyter automatically, good for reproducible research

Data format support and data ingestion

pillow - Image manipulation. The original PIL is no longer maintained, the new “Pillow” is a drop-in replacement.
h5py and PyTables - Interfaces to the HDF5 file format.

Speeding up code and parallelism

MPI for Python (mpi4py) - Message Passing Interface (MPI) in Python for parallelizing jobs.
cython - easily make C extensions for Python, also interface to C libraries
numba - just in time compiling of functions for speed-up
PyPy - Python written in Python so that it can internally optimize more.
Dask - Distributed array data structure for distributed computation
Joblib - Easy embarrassingly parallel computing
IPyParallel - Easy parallel task engine.
numexpr - Fast evaluation of array expressions by automatically compiling the arithmetic.

Machine learning

nltk - Natural language processing toolkit.
scikit-learn - Traditional machine learning toolkit.
xgboost - Toolkit for gradient boosting algorithms.

Deep learning

tensorflow - Deep learning library by Google.
pytorch - Currently the most popular deep learning library.
keras - Simple libary for doing deep learning.
huggingface - Ecosystem for sharing and running deep learning models and datasets. Incluses packages like transformers, datasets, accelerate, etc.
jax - Google’s Python library for running NumPy and automatic differentiation on GPUs.
flax - Neural network framework built on Jax.
equinox - Another neural network framework built on Jax.
DeepSpeed - Algorithms for running massive scale trainings. Included in many of the frameworks.
PyTorch Lightning - Framework for creating and training PyTorch models.
Tensorboard <https://www.tensorflow.org/tensorboard/> - Tool for visualizing model training on a web page.

Other packages for special cases

dateutil and pytz - Date arithmetic and handling, timezone database and conversion.

Connecting Python to other languages

As we discussed with Scipy, very many of the above packages aren’t written in Python: they are written in some other language and have a Python interface. Python is written in C, and thus has great C interfaces. This contributes to two things:

Extending Python by writing your own modules in C.
- It’s actually common to first have (or write) an analysis package in C or C++, then make the Python interface. Then it can be supported by other languages, too.
- Or one starts an analysis package in Python, and slowly moves bits of it to C over time as there is need.
Embedding Python, where you have another primary application that uses Python under the hood as an internal scripting language.

These features aren’t exactly unique to Python, but Python does support them very well. Read more: Extending and embedding Python.

Tools for interfacing with other languages

These days, one rarely directly extends the Python interpreter, but uses

cffi and ctypes - interface to C and compatible libraries
cython - easily make C extensions for Python, also interface to C libraries
f2py - interface to Fortran code
swig - connect to a variety of programming languages.
Boost.python - Another Python/C++ interface
TODO: Julia modules for Python?

Evaluating Python packages for reuse

Above, we talked about well-maintained mainstream packages. Do you trust random code you find online (for example included in a paper)?

Especially consider scientific results, which have to be correct. Still, you also can’t build everything yourself, so you have to carefully evaluate the situation.

Below are some things to consider:

Are there releases? Have they been going on for a while?
Are releases installable without copy-paste?
Are dependencies handled well?
Does the code randomly change, so that it no longer works with your code. Is this relevant?
Is there good documentation, that not just tells how to use it but how it works?
Is there automated testing? What’s your evaluation of the risk of undetectable scientific errors?
Is there a community, or is it one person? Is it backed by some organization? Does it have a permanent home?
Is it is a public hosting site (GitLab, GitHub, Bitbucket, etc) where a community could form?
Do others post issues and make contributions? Are these issues dealt with in a timely manner? Can you search past bug reports?
Is the software citeable?

Is your work reuseable?

Every small project you do contributes a little bit to the Python and SciPy ecosystem. This course has sort of started you on that path, and a CodeRefinery workshop will make sure you have the tools to produce high-quality, reusable code.

What’s next?

The CodeRefinery workshop mentioned above will prepare you for others to reuse your code and for you to contribute to other code.
The upcoming Dependency management lesson will teach you how to record and manage dependencies so that anyone can seamlessly reuse your code.

Exercises

Libraries 1.1: Libraries in your work

What libraries do you use in your work? What have you made, which you could have reused from some other source. What have you used from some other source that you wished you had re-created?

Discuss in your groups or HackMD.

Libraries 1.1

… is there anything to say here?

Libraries 1.2: Evaluating packages

Below are some links to some packages, both public and made by the authors of this lesson. Evaluate them, considering “would I use this in my project?”

https://github.com/networkx/networkx/
some code on webpage in a paper’s footnote
https://github.com/rkdarst/pcd
https://github.com/dftlibs/numgrid
https://github.com/rkdarst/dynbench
https://vpython.org/

Libraries 1.2

networkx: This seems to be a relatively large, active project using best practices. Probably usable.
I would probably use it if I had to, but would prefer not to.
This (written by one of the authors of this lesson) has no documenting, no community, no best practices, and is very old. Probably not a good idea to try to use it
This project uses best practices, but doesn’t seem to have a big community. It’s probably fine to use, but who knows if it will be maintained 10 years from now. It does have automated tests via Github Actions (.github/workflows and the green checks), so the authors have put some work into making it correct.
This (also written by one of the authors) looks like it was made for a paper of some sort. It has some minimal documentation, but still is missing many best practices and is clearly not maintained anymore (look at the ancient pull request). Probably not a good idea to use unless you have to.
This project has a pretty website, and some information. But seems to not be using best practices of an open repository, and custom locations which could disappear at any time.

You notice that several of the older projects here were written by one of the authors of this lesson. It goes to show that everyone starts somewhere and improves over time - don’t feel bad if your work isn’t perfect, as long as you keep trying to get better!