Library ecosystem
Questions
What happens when you need some method beyond what we discuss in this course, what is available?
How do you decide what to build on for your work?
Objectives
Know of some other available packages, but don’t necessarily know how to use them.
Be able to evaluate what you should reuse and what you should develop yourself.
You can’t do everything yourself. In fact, once we heard a quote such as this:
When you are a student, you are expected to do everything yourself, and that is how you are evaluated. When you become a researcher, you have to be able to reuse what others have done. We don’t have much practice in doing this. – A student
In this lesson, we’ll talk about the broader ecosystem in Python: all the resources you have available to you. Perhaps we can even classify this into two types:
Well-maintained libraries that are used by many others.
A wide variety of public code that might work but isn’t necessarily well-maintained (for example, code from articles).
We’ll start with the first then go to the second.
Glossary
- Library
A collection of code used by a program.
- Package
A library that has been made easily installable and reusable. Often published on public repositories such as the Python Package Index
- Dependency
A requirement of another program, not included in that program.
The Python/SciPy ecosystem
This section is nothing more than a tour of what exists in Python. You aren’t expected to particularly remember any of these right now, but searching for these repositories is a starting point of a lot of future work.
The “core” packages could be considered. Many other packages build on these, and others that try to do similar things often try to conform to their interfaces (especially numpy):
Python
Numpy - arrays, everything builds on this
Scipy - scientific functions (not necessarily a lot builds on this)
matplotlib - plotting, many other plotting tools build on this
pandas - data structures
IPython / Jupyter: interactive work
Core numerics libraries
Plotting
matplotlib - Base plotting package, somewhat low level but almost everything builds on it.
seaborn - Higher level plotting interface; statistical graphics.
Vega-Altair - Declarative Python plotting.
mayavi - 3D plotting
Plotly - Big graphing library.
Data analysis and other important core packages
pandas - Columnar data analysi.
polars <https://pola.rs/> - Alternative to pandas that uses similar API, but is re-imagined for more speed.
Vaex - Alternative for pandas that uses similar API for lazy-loading and processing huge DataFrames.
Dask - Alternative to Pandas that uses similar API and can do analysis in parallel.
xarrray - Framework for working with mutli-dimensional arrays.
statsmodels - Statistical models and tests.
SymPy - Symbolic math.
networkx - Graph and network analysis.
graph-tool - Graph and network analysis toolkit implemented in C++.
Interactive computing and human interface
Interactive computing
Testing
pytest - Automated testing interface
Documentation
Sphinx - Documentation generator (also used for this lesson…)
Development environments
Spyder - Interactive Python development environment.
Visual Studio Code - Microsoft’s flagship code editor.
PyCharm - JetBrains’s Python IDE.
Binder - load any git repository in Jupyter automatically, good for reproducible research
Data format support and data ingestion
Speeding up code and parallelism
MPI for Python (mpi4py) - Message Passing Interface (MPI) in Python for parallelizing jobs.
cython - easily make C extensions for Python, also interface to C libraries
numba - just in time compiling of functions for speed-up
PyPy - Python written in Python so that it can internally optimize more.
Dask - Distributed array data structure for distributed computation
Joblib - Easy embarrassingly parallel computing
IPyParallel - Easy parallel task engine.
numexpr - Fast evaluation of array expressions by automatically compiling the arithmetic.
Machine learning
nltk - Natural language processing toolkit.
scikit-learn - Traditional machine learning toolkit.
xgboost - Toolkit for gradient boosting algorithms.
Deep learning
tensorflow - Deep learning library by Google.
pytorch - Currently the most popular deep learning library.
keras - Simple libary for doing deep learning.
huggingface - Ecosystem for sharing and running deep learning models and datasets. Incluses packages like
transformers
,datasets
,accelerate
, etc.jax - Google’s Python library for running NumPy and automatic differentiation on GPUs.
flax - Neural network framework built on Jax.
equinox - Another neural network framework built on Jax.
DeepSpeed - Algorithms for running massive scale trainings. Included in many of the frameworks.
PyTorch Lightning - Framework for creating and training PyTorch models.
Tensorboard <https://www.tensorflow.org/tensorboard/> - Tool for visualizing model training on a web page.
Other packages for special cases
Connecting Python to other languages
As we discussed with Scipy, very many of the above packages aren’t written in Python: they are written in some other language and have a Python interface. Python is written in C, and thus has great C interfaces. This contributes to two things:
Extending Python by writing your own modules in C.
It’s actually common to first have (or write) an analysis package in C or C++, then make the Python interface. Then it can be supported by other languages, too.
Or one starts an analysis package in Python, and slowly moves bits of it to C over time as there is need.
Embedding Python, where you have another primary application that uses Python under the hood as an internal scripting language.
These features aren’t exactly unique to Python, but Python does support them very well. Read more: Extending and embedding Python.
Tools for interfacing with other languages
These days, one rarely directly extends the Python interpreter, but uses
Evaluating Python packages for reuse
Above, we talked about well-maintained mainstream packages. Do you trust random code you find online (for example included in a paper)?
Especially consider scientific results, which have to be correct. Still, you also can’t build everything yourself, so you have to carefully evaluate the situation.
Below are some things to consider:
Are there releases? Have they been going on for a while?
Are releases installable without copy-paste?
Are dependencies handled well?
Does the code randomly change, so that it no longer works with your code. Is this relevant?
Is there good documentation, that not just tells how to use it but how it works?
Is there automated testing? What’s your evaluation of the risk of undetectable scientific errors?
Is there a community, or is it one person? Is it backed by some organization? Does it have a permanent home?
Is it is a public hosting site (GitLab, GitHub, Bitbucket, etc) where a community could form?
Do others post issues and make contributions? Are these issues dealt with in a timely manner? Can you search past bug reports?
Is the software citeable?
Is your work reuseable?
Every small project you do contributes a little bit to the Python and SciPy ecosystem. This course has sort of started you on that path, and a CodeRefinery workshop will make sure you have the tools to produce high-quality, reusable code.
What’s next?
The CodeRefinery workshop mentioned above will prepare you for others to reuse your code and for you to contribute to other code.
The upcoming Dependency management lesson will teach you how to record and manage dependencies so that anyone can seamlessly reuse your code.
Exercises
Libraries 1.1: Libraries in your work
What libraries do you use in your work? What have you made, which you could have reused from some other source. What have you used from some other source that you wished you had re-created?
Discuss in your groups or HackMD.
Libraries 1.1
… is there anything to say here?
Libraries 1.2: Evaluating packages
Below are some links to some packages, both public and made by the authors of this lesson. Evaluate them, considering “would I use this in my project?”
some code on webpage in a paper’s footnote
Libraries 1.2
networkx: This seems to be a relatively large, active project using best practices. Probably usable.
I would probably use it if I had to, but would prefer not to.
This (written by one of the authors of this lesson) has no documenting, no community, no best practices, and is very old. Probably not a good idea to try to use it
This project uses best practices, but doesn’t seem to have a big community. It’s probably fine to use, but who knows if it will be maintained 10 years from now. It does have automated tests via Github Actions (
.github/workflows
and the green checks), so the authors have put some work into making it correct.This (also written by one of the authors) looks like it was made for a paper of some sort. It has some minimal documentation, but still is missing many best practices and is clearly not maintained anymore (look at the ancient pull request). Probably not a good idea to use unless you have to.
This project has a pretty website, and some information. But seems to not be using best practices of an open repository, and custom locations which could disappear at any time.
You notice that several of the older projects here were written by one of the authors of this lesson. It goes to show that everyone starts somewhere and improves over time - don’t feel bad if your work isn’t perfect, as long as you keep trying to get better!
See also
Topical Software in the SciPy ecosystem - relatively detailed (but not comprehensive) list of projects
Keypoints
Almost everything you need can already be found, except your incremental work.
When do you build on that other work, and when do you create things yourself?