Dependency management
Questions
Do you expect your code to work in one year? Five? What if it uses
numpy
ortensorflow
orrandom-github-package
?How can my collaborators get the same results as me? What about future me?
How can my collaborators easily install my codes with all the necessary dependencies?
How can I make it easy for my colleagues to reproduce my results?
How can I work on two (or more) projects with different and conflicting dependencies?
Objectives
Learn how to record dependencies
Be able to communicate the dependencies as part of a report/thesis/publication
Learn how to use isolated environments for different projects
Simplify the use and reuse of scripts and projects
How do you track dependencies of your project?
Dependency: Reliance on a external component. In this case, a separately installed software package such as
numpy
.
Exercises 1
Dependencies-1: Discuss dependency management (5 min)
Please discuss and answer via collaborative document the following questions:
How do you install Python packages (libraries) that you use in your work? From PyPI using pip? From other places using pip? Using conda?
How do you track/record the dependencies? Do you write them into a file or README? Into
requirements.txt
orenvironment.yml
?If you track dependencies in a file, why do you do this?
Have you ever experienced that a project needed a different version of a Python library than the one on your computer? If yes, how did you solve it?
PyPI (The Python Package Index) and (Ana)conda
PyPI (The Python Package Index) and Conda are popular packaging/dependency management tools.
When you run
pip install
you typically install from PyPI but you can alsopip install
from a GitHub repository and similar.When you run
conda install
you typically install from Anaconda Cloud where there are conda channels maintained by Anaconda Inc. and by various communities.
Why are there two ecosystems?
PyPI
Installation tool:
pip
Summary: PyPI is traditionally used for Python-only packages or for Python interfaces to external libraries. There are also packages that have bundled external libraries (such as numpy).
Amount of packages: Huge number. Old versions are supported for a long time.
How libraries are handled: If your code depends on external libraries or tools, these things need to be either included in the pip-package or provided via some other installation system (like operating system installer or manual installation).
- Pros:
Easy to use
Package creation is easy
- Cons:
Installing packages that need external libraries can be complicated
Conda
Installation tool:
conda
ormamba
Summary: Conda aims to be a more general package distribution tool and it tries to provide not only the Python packages, but also libraries and tools needed by the Python packages. Most scientific software written in Python uses external libraries to speed up calculations and installing these libraries can often become complicated without conda.
Amount of packages: Curated list of packages in defaults-channel, huge number in community managed channels. Other packages can be installed via pip.
How libraries are handled: Required libraries are installed as separate conda packages.
- Pros:
Quite easy to use
Easier to manage packages that need external libraries
- Cons:
Package creation is harder
Anaconda vs. miniconda vs. conda vs. mamba vs. Anaconda Cloud vs. conda-forge vs. miniforge
Package sources:
Anaconda Cloud - a package cloud maintained by Anaconda Inc. It is a free repository that houses conda package channels.
Conda-forge - the largest open source community channel.
Package managers:
conda - a package and environment management system used by Anaconda. It is an open source project maintained by Anaconda Inc..
mamba - a drop in replacement for conda that does installations faster.
Package manager deployments:
Anaconda - a distribution of conda packages made by Anaconda Inc.. It is free for academic and non-commercial use.
Miniconda - a minimal installer that has conda and uses default channels by default.
Miniforge - Miniconda replacement that uses conda-forge as the default channel. Contains mamba as well.
In the packaging episode we will meet PyPI and Anaconda again and practice how to share Python packages.
Creating isolated environments
An isolated environment allows installing packages without affecting the rest of your operating system or any other projects. Isolated environments solve a couple of problems:
You can install specific versions of packages into them.
You can create one environment for each project and you won’t encounter any problems if the two projects require different versions of packages.
If you make some mistake and install something you did not want or need, you can remove the environment and create a new one.
You can export a list of packages in an environment and share it with your code. This makes replicating your results easier.
Exercises 2
Dependencies-2: Create a conda environment (15 min)
Chloe just joined your team and will be working on her Master Thesis. She is quite familiar with Python, still finishing some Python assignments (due in a few weeks) and you give her a Python code for analyzing and plotting your favorite data. The thing is that your Python code has been developed by another Master Student (from last year) and requires a older version of Numpy (1.24.3) and Matplotlib (3.7.2) (otherwise the code fails). The code could probably work with a recent version of Python but has been validated with Python 3.10 only. Having no idea what the code does, she decides that the best approach is to create an isolated environment with the same dependencies that were used previously. This will give her a baseline for future upgrade and developments.
For this first exercise, we will be using conda for creating an isolated environment.
Create a conda environment:
$ conda create --name python310-env python=3.10 numpy=1.24.3 matplotlib=3.7.2
Conda environments can also be managed (create, update, delete) from the anaconda-navigator. Check out the corresponding documentation here.
Activate the environment:
$ conda activate python310-env
conda activate versus source activate
conda activate
will only work if you have runconda init
in the past. Runningconda init
will make loading environments easier as you will always have a conda environment loaded.However, this can also cause problems as programs in the main environment will be constantly loaded and they might be used even when they’re not supposed to be used. A common example is not having
pip
installed in a conda environment which resultspip
from main environment to be used instead.You can always try:
$ source activate python310-env
Open a Python console and check that you have effectively the right version for each package:
import numpy import matplotlib print('Numpy version: ', numpy.__version__) print('Matplotlib version: ', matplotlib.__version__)
Or use the one-liner if you have access to a terminal like bash:
$ python -c 'import numpy; print(numpy.__version__)' $ python -c 'import matplotlib;print(matplotlib.__version__)'
Deactivate the environment:
$ conda deactivate
Check Numpy and Matplotlib versions in the default environment to make sure they are different from python310-env.
There is no need to specify the conda environment when using deactivate. It deactivates the current environment.
Exercises 3
Dependencies-3: Create a virtualenv (15 min, optional)
This is the same exercise as before but we use virtualenv rather than conda.
Create a venv:
$ python3 -m venv scicomp
Here
scicomp
is the name of the virtual environment. It creates a new folder calledscicomp
.Activate it. To activate your newly created virtual environment locate the script called
activate
and source it.Linux/Mac-OSX: look at
bin
folder in thescicomp
folder:$ source scicomp/bin/activate
Windows: most likely you can find it in the
Scripts
folder.
Install Numpy 1.24.3 and Matplotlib 3.7.2 into the virtual environment:
$ pip install numpy==1.24.3 $ pip install matplotlib==3.7.2
Deactivate it:
$ deactivate
Problems that might happen with manual installation
Running the install commands manually can result in unexpected behaviour such as:
The installer might remove an already installed packages or update them.
The installer might not find a package that works with already installed packages.
The reason for this is that the installer does not know what commands you ran in the past. It only knows the state of your environment and what you’re currently telling it to install.
These kinds of problems can be mitigated by recording dependencies in an
environment.yml
or requirements.txt
.
Recording dependencies
There are two standard ways to record dependencies for Python projects:
requirements.txt
and environment.yml
.
requirements.txt
(used by virtual environment) is a simple
text file which looks like this:
numpy
matplotlib
pandas
scipy
environments.yml
(for conda) is a yaml-file which looks like this:
name: my-environment
channels:
- defaults
dependencies:
- numpy
- matplotlib
- pandas
- scipy
If you need to recreate the exact same environment later on, it can be very useful to pin dependencies to certain versions. For example, there is usually a delay between doing research and that research being published. During this time the dependencies might update and reviewers or interested researchers might not be able to replicate your results or run your code.
Conda channels
Sometimes the package version you would need does not seem to be available. You may have to select another conda channel.
Most popular channels are defaults, which is managed by Anaconda Inc. and conda-forge, which is managed by the open source community. These two channels are mutually incompatible.
Channel priority goes from top to bottom.
Here are the two files again, but this time with versions pinned:
requirements.txt
with versions:
numpy==1.24.3
matplotlib==3.7.2
pandas==2.0.3
scipy==1.10.1
environments.yml
with versions:
name: my-environment
channels:
- defaults
dependencies:
- python=3.10
- numpy=1.24.3
- matplotlib=3.7.2
- pandas=2.0.3
- scipy=1.10.1
Conda can also read and write
requirements.txt
.requirements.txt
can also refer to packages on Github.environments.yml
can also contain apip
section.See also: https://coderefinery.github.io/reproducible-research/dependencies/ .
Putting too strict requirements can be counter-productive
Putting exact version numbers can be good for single-use applications, like replicating a research paper, but it is usually bad for long-term maintenance because the program won’t update at the same time as it’s requirements do.
If you’re creating a library, adding strict dependencies can also create a situation where the library cannot coexist with another library.
Dependencies 4
Dependencies-4: Freeze an environment (15 min)
Create the file
environment.yml
orrequirements.txt
- Create an environment based on these dependencies:
Conda:
$ conda env create --file environment.yml
Virtual environment: First create and activate, then
$ pip install -r requirements.txt
- Freeze the environment:
Conda:
$ conda env export > environment.yml
Virtual environment:
$ pip freeze > requirements.txt
Have a look at the generated (“frozen”) file.
Hint: Updating packages from dependency files
Instead of installing packages with $ pip install somepackage
,
you can add somepackage
to requirements.txt
and re-run
$ pip install -r requirements.txt
.
With conda, you can add the package to environment.yml
and
run $ conda env update --file environment.yml
How to communicate the dependencies as part of a report/thesis/publication
Each notebook or script or project which depends on libraries should come with
either a requirements.txt
or a environment.yml
, unless you are creating
and distributing this project as Python package (see next section).
Attach a
requirements.txt
or aenvironment.yml
to your thesis.Even better: put
requirements.txt
or aenvironment.yml
in your Git repository along your code.Even better: also binderize your analysis pipeline (more about that in a later session).
Version pinning for package creators
We will talk about packaging in a different session but when you create a library and package
projects, you express dependencies either in setup.py
or pyproject.toml
(PyPI) or meta.yaml
(conda).
These dependencies will then be used by either other libraries (who in turn
write their own setup.py
or pyproject.toml
or meta.yaml
) or by
people directly (filling out requirements.txt
or a environment.yml
).
Now as a library creator you have a difficult choice. You can either pin versions very
narrowly like here (example taken from setup.py
):
# ...
install_requires=[
'numpy==1.19.2',
'matplotlib==3.3.2'
'pandas==1.1.2'
'scipy==1.5.2'
]
# ...
or you can define a range or keep them undefined like here (example taken from
setup.py
):
# ...
install_requires=[
'numpy',
'matplotlib'
'pandas'
'scipy'
]
# ...
Should we pin the versions here or not?
Pinning versions here would be good for reproducibility.
However pinning versions may make it difficult for this library to be used in a project alongside other libraries with conflicting version dependencies.
Therefore as library creator make the version requirements as wide as possible.
Set minimum version when you know of a reason:
>=2.1
Sometimes set maximum version to next major version (
<4
) (when you currently use3.x.y
) when you expect issues with next major version.
As the “end consumer” of libraries, define your dependencies as narrowly as possible.
See also
Other tools for dependency management:
Poetry: dependency management and packaging
Pipenv: dependency management, alternative to Poetry
pyenv: if you need different Python versions for different projects
micropipenv: lightweight tool to “rule them all”
mamba: a drop in replacement for conda that does installations faster.
miniforge: Miniconda alternative with conda-forge as the default channel and optionally mamba as the default installer.
micromamba: tiny version of Mamba as a static C++ executable. Does not need base environment or Python for installing an environment.
Other resources:
Keypoints
Install dependencies by first recording them in
requirements.txt
orenvironment.yml
and install using these files, then you have a trace.Use isolated environments and avoid installing packages system-wide.