Scripts

Questions

Why are command line programs useful, compared to Jupyter notebooks and similar?
How to create a Python script?
How to generalize a Python script?

Objectives

Learn how to streamline your Python notebooks by creating repeatable Python scripts
Learn how to import other Python files
Learn to parse command line arguments in Python

Why scripts?

So far we have been learning Python using Jupyter notebooks. It is very convenient: it allowed us to experiment and prototype Python code so we may think that is more than enough for your day to day work.

But after several weeks of hard work with Python, you may end up:

either with 10 different notebooks (so that you can run them concurrently)
or with a very long notebook which is becoming hardly readable!

Let’s imagine you have created 10 notebooks to run for 10 different input parameters and now you are willing to experiment with 1000 sets of input parameters. Suppose you find a bug in the original notebook and need to rerun everything: are you willing to re-create manually your 1000 notebooks?

In this episode, we will learn how to automate your work using Python scripts so that

you do not need to manually configure your notebooks to be able to run with different parameters
can easily run you work via other tools, such as on computing clusters.

From Jupyter notebooks to Python scripts

Save as Python script

Jupyter notebooks can be parameterized for instance using papermill. It can be an attractive approach when you have short notebooks (to generate automatically plots/reports) but as soon as you have more complex tasks to execute, we strongly recommend to generate Python scripts. This will also force you to modularize your code. See CodeRefinery’s lesson on Modular code development.

You need to convert the notebook to a Python file. Check the JupyterLab documentation for more information. You can get a command line by (File → New Launcher → Terminal - if you go through New Launcher, your command line will be in the directory you are currently browsing), you can convert files in the terminal by running:

$ jupyter nbconvert --to script your_notebook_name.ipynb

If nbconvert doesn’t work, within JupyterLab, you can export any Jupyter notebook to a Python script, but this downloads it to your own computer and then you need to copy it to a place you are working (maybe upload it back to JupyterLab?):

JupyterLab screenshot as described in caption. — Select File (top menu bar) → Export Notebook as → **Export notebook to Executable Script**.

Exercises 1

Scripts-1

Download the weather_observations.ipynb and upload them to your Jupyterlab. The script plots the temperature data for Tapiola in Espoo. The data is originally from rp5.kz and was slightly adjusted for this lecture.

Hint: Copy the URL above (right-click) and in JupyterLab, use File → Open from URL → Paste the URL. It will both download it to the directory JupyterLab is in and open it for you.
Open a terminal in Jupyter: File → New Launcher, then click “Terminal” there. (if you do it this way, it will be in the right directory. File → New → Terminal might not be.)

Convert the Jupyter script to a Python script by calling:

$ jupyter nbconvert --to script weather_observations.ipynb

Run the script (note: you may have python3 rather than python):
```
$ python weather_observations.py
```

Command line arguments with `sys.argv`

We now have a Python script that is callable from the command line (e.g. for use on an HPC system). However, this code is still not adjustable, as we still need to have a copy for each single time range we want to plot, or need to modify our file whenever we want to just change parameters. What we need is to allow the code to do something different based on something outside the code itself: in this case, to plot information for different time ranges. This can be achieved by using Pythons sys package, which provides access to arguments given to the Python interpreter at startup in the sys.argv list. The first (i.e. sys.argv[0] entry of this array is the script that is running, and any further argument (separated by space) is appended to this list, like such:

$ python my_script.py A B
$ # sys.argv[1] is 'A'
$ # sys.argv[2] is 'B'

Lets see how it works: We modify the weather_observations.py script such that we allow start and end times as well as the output file to be passed in as arguments to the function. Open it (find the .py file from the JupyterLab file browser) and make these edits:

import sys
import pandas as pd

# define the start and end time for the plot
start_date = pd.to_datetime(sys.argv[1], dayfirst=True)
end_date = pd.to_datetime(sys.argv[2], dayfirst=True)
...

# select the data
weather = weather[weather['Local time'].between(start_date,end_date)]
...

# save the figure
output_file_name = sys.argv[3]
fig.savefig(output_file_name)

We can try it out (see the file spring_in_tapiola.png made in the file browser):

$ python weather_observations.py 01/03/2021 31/05/2021 spring_in_tapiola.png

Discussion

Does it work?
Why is this better than modifying the script every time I want it to plot data for a different period?
What problems do you expect when using this approach (using sys.argv)?

This approach is brittle and more robust solutions exist that allow you to fully customize your scripts and generate help texts at the same time:

argparse: built-in to Python, this is the one that we will show below.
doctopt: you write the help text and this generates a parser for you.
click: another nice library for command line interfaces - very easy to use.

Parsing command line arguments with `argparse`

Argparse not only gives you descriptive command line arguments, it also automatically generates a --help option for you. To use argparse you first set up a parser by calling parser = argparse.ArgumentParser() and then you add arguments using parser.add_argument(args). There are two different types of arguments:

Positional arguments
Optional arguments

Positional arguments are detected by their order, while optional arguments need to be given with their respective flags ( like --name or -n). The following example would parse a positional argument Name of type string and an optional argument date of type string which defaults to 01/01/2000.

import argparse

parser = argparse.ArgumentParser()
# One positional and one optional argument
parser.add_argument('name', type=str, metavar="N",
                    help="The name of the subject")
parser.add_argument('-d', '--date', type=string, default="01/01/2000",
                    help="Birth date of the subject")

args = parser.parse_args()

print(args.name + " was born on " + args.date)

If this code was in birthday.py and we would call python birthday.py --help it would show the following message:

$ python birthday.py --help
usage: birthday.py [-h] [-d DATE] N

positional arguments:
  N                     The name of the subject

optional arguments:
  -h, --help            show this help message and exit
  -d DATE, --date DATE  Birth date of the subject

Exercises 2

Scripts-2

Take the Python script (weather_observations.py) we have written in the preceding exercise and use argparse to specify the input (URL) and output files and allow the start and end dates to be set.
- Hint: try not to do it all at once, but add one or two arguments, test, then add more, and so on.
- Hint: The input and output filenames make sense as positional arguments, since they must always be given. Input is usually first, then output.
- Hint: The start and end dates should be optional parameters with the defaults as they are in the current script.
Execute your script for a few different time intervals (e.g. from January 2019 to June 2020, or from May 2020 to October 2020). Also try using this data for Cairo: https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/scripts/weather_cairo.csv

Solution

import pandas as pd
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("input", type=str, help="Input data file")
parser.add_argument("output", type=str, help="Output plot file")
parser.add_argument("-s", "--start", default="01/01/2019", type=str, help="Start date in DD/MM/YYYY format")
parser.add_argument("-e", "--end", default="16/10/2021", type=str, help="End date in DD/MM/YYYY format")

args = parser.parse_args()

# load the data
weather = pd.read_csv(args.input,comment='#')

# define the start and end time for the plot
start_date=pd.to_datetime(args.start, dayfirst=True)
end_date=pd.to_datetime(args.end, dayfirst=True)

# preprocess the data
weather['Local time'] = pd.to_datetime(weather['Local time'], dayfirst=True)
# select the data
weather = weather[weather['Local time'].between(start_date,end_date)]

# plot the data
import matplotlib.pyplot as plt
# start the figure.
fig, ax = plt.subplots()
ax.plot(weather['Local time'], weather['T'])
# label the axes
ax.set_xlabel("Date of observation")
ax.set_ylabel("Temperature in Celsius")
ax.set_title("Temperature Observations")
# adjust the date labels, so that they look nicer
fig.autofmt_xdate()

# save the figure
fig.savefig(args.output)

Discussion

What was the point of doing this?

Now you can do this:

$ python weather_observations.py --help
$ python weather_observations.py https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/scripts/weather_tapiola.csv temperature_tapiola.png
$ python weather_observations.py -s 1/12/2020 -e 31/12/2020 https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/scripts/weather_tapiola.csv temperature_tapiola_dec.png
$ python weather_observations.py -s 1/2/2021 -e 28/2/2021 https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/scripts/weather_tapiola.csv temperature_tapiola_feb.png
$ python weather_observations.py https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/scripts/weather_cairo.csv temperature_cairo.png

We can now process different input files without changing the script.
We can select multiple time ranges without modifying the script.
We can easily save these commands to know what we did.
This way we can also loop over file patterns (using shell loops or similar) or use the script in a workflow management system and process many files in parallel.
By changing from sys.argv to argparse we made the script more robust against user input errors and also got a help text (accessible via --help).

Load larger option lists using config files

In the above example we only allowed the input and output files along with start and end dates to be selected by command line arguments. This already leads to a quite large command line call. Now imagine, that we also want to allow the user to select more specific information from the dataset, define specific X and Y labels, write their own title etc. Now imagine to put all this into the command line:

$ python weather_observations.py --input https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/scripts/weather_cairo.csv --output rain_in_tapiola.png --xlabel "Days in June" --ylabel "Rainfall in mm" --title "Rainfall in Cairo" --data_column RRR --start 01/06/2021 --end 30/06/2021

This is an even larger line, needs scrolling and becomes quite inconvenient to modify. Instead of putting all of this into the command line, you could think about storing and modifying the arguments in a config file. There are several ways, how config files can be stored. You can use a simple Parameter = Value format, and parse it yourself, or you can use e.g. the JSON or YAML formats. For both parsers exist that can save you some work, and both formats also allow you to use more complex input data, like lists, or dictionaries. We won’t go into the details of the formats, and will only give a short example using YAML here.

The YAML file format can be simple or very complex allowing a large variety of data structures to be stored. One benefit of YAML is that there is already a Python module (yaml) available for parsing it and it directly parses numbers as numbers and text as strings, making conversions unnecessary (the same is true for JSON with the json package).

The Python module optionsparser.py provides a simple parser for YAML styled options files. Similar to argparse, it takes a dict of required options, along with a dict of optional parameters. Required arguments need to specify a type. Optional argument types are derived from their default values.

In our example above, we could for example add optional parameters that allow the selection of other weather data from the dataset (precipitation …), set the labels and titles explicitly etc.

In the YAML format, names and values are separated by :. Our above example would therefore translate to the following YAML file:

input:        https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/scripts/weather_cairo.csv
output:       rain_in_cairo.png
xlabel:       Days in June
ylabel:       Rainfall in mm
title:        Rainfall in Cairo
data_column:  RRR
start:        01/06/2021
end:          30/06/2021

Exercises 3 (optional)

Scripts-3

Download the optionsparser.py function and load it into your working folder in Jupyterlab (Hint: in JupyterLab, File → Open from URL). Modify the previous script to use a config file parser to read all arguments. The config file is passed in as a single argument on the command line (using e.g. argparse or sys.argv) still needs to be read from the command line.
Run your script with different config files.

Solution

The modified weather_observations.py script:

#!/usr/bin/env python
# coding: utf-8

import pandas as pd
from optionsparser import get_parameters
import argparse 

# Lets start reading our confg file. we'll use argparse to get the config file.
parser = argparse.ArgumentParser()
parser.add_argument('input', type=str,
                    help="Config File name ")
args = parser.parse_args()

# Set optional parameters with default values and required parameter values with their type
defaults = {
           "xlabel"      : "Date of observation",
           "title"       : "Weather Observations",
           "start"       : "01/06/2021",
           "end"         : "01/10/2021",
           "output"      : "weather.png",
           "ylabel"      : "Temperature in Celsius",
           "data_column" : "T",
           }

required = {
           "input"  : str
           }
           
# now, parse the config file
parameters = get_parameters(args.input, required, defaults)

# load the data
weather = pd.read_csv(parameters.input,comment='#')

# obtain start and end date
start_date=pd.to_datetime(parameters.start, dayfirst=True)
end_date=pd.to_datetime(parameters.end, dayfirst=True)

# Data preprocessing
weather['Local time'] = pd.to_datetime(weather['Local time'], dayfirst=True)
# select the data
weather = weather[weather['Local time'].between(start_date,end_date)]

# Data plotting
import matplotlib.pyplot as plt
# start the figure.
fig, ax = plt.subplots()
ax.plot(weather['Local time'], weather['T'])
# label the axes
ax.set_xlabel("Date of observation")
ax.set_ylabel("Temperature in Celsius")
ax.set_title("Temperature Observations")
# adjust the date labels, so that they look nicer
fig.autofmt_xdate()


# save the figure
fig.savefig(parameters.output)

What did this config file parser get us? Now, we have separated the code from the configuration. We could save all the configuration in version control - separately and have one script that runs them. If done right, our work could be much more reproducible and understandable.