List of exercises

Full list

This is a list of all exercises and solutions in this lesson, mainly as a reference for helpers and instructors. This list is automatically generated from all of the other pages in the lesson. Any single teaching event will probably cover only a subset of these, depending on their interests.

Jupyter

In jupyter.ipynb:

Exercises: Jupyter-1

  1. Start Jupyter in the directory you want to use for this course.

    • If you are starting from the navigator, change to the directory you want to use.

    • If you are starting from the command line, you should navigate to the directory you want to use first.

  2. Create a Python 3 notebook file. Save it. In the next section, you will add stuff to it.

  3. (optional, but will be done in future lessons) Explore the file browser, try making some non-notebook text/py/md files and get used to that.

  4. (optional, advanced) Look at the notebook file in a text editor. How does it work?

If everything works for you, this will end very quickly. You can begin reading the next sections independently.

In jupyter.ipynb:

Exercises: Jupyter-2

  1. Run some trivial code, such as print(1).

  2. Run some slightly less trivial code, like print out the first ten Fibonacci numbers.

  3. Make a Markdown cell above your code cell and give it a title and some description of your function. Use the reference to add a heading, bullet list, and some (bold, italic, or inline code)

  4. Use the %%timeit magic function to time your Fibonacci function.

  5. Again using %%timeit, figure out the fastest way to sum the numbers 0 to 1000000.

  6. Once you are done, close your notebooks and other tabs you don’t need. Check the running sessions (hint: thin left sidebar) and shut down these kernels.

In jupyter.ipynb:

In jupyter.ipynb:

Exercises: Jupyter-3

(optional) Discuss the following in groups:

  1. Have any of you used Jupyter in a way that became impossible to maintain: too many files, code all spread out, not able to find your code and run it in the right order. How did you solve that?

  2. On the other hand, what are your successes with Jupyter?

  3. How can you prevent these problems by better development strategies?

NumPy

In numpy.rst:

Exercises: Numpy-1

  1. Datatypes Try out np.arange(10) and np.linspace(0,9,10), what is the difference? Can you adjust one to do the same as the other?

  2. Datatypes Create a 3x2 array of random float numbers (check numpy.random.random()) between 0 and 1. Now change the arrays datatype to int (array.astype). How does the array look like?

  3. Reshape Create a 3x2 array of random integer numbers between 0 and 10. Change the shape of the array (check array.reshape) in any way possible. What is not possible?

  4. NumPyI/O Save above array to .npy file (numpy.save()) and read it in again.

In numpy.rst:

In numpy.rst:

Exercises: Numpy-2

In numpy.rst:

In numpy.rst:

Exercise: Numpy-3

a = np.eye(4)
b = a[:,0]
b[0] = 5
  • View vs copy Try out above code. How does a look like before b has changed and after? How could it be avoided?

In numpy.rst:

In numpy.rst:

Exercises: Numpy-4

  • In-place addition: Create an array, add it to itself using a ufunc.

  • In-place addition (advanced): Create an array of dtype='float', and an array of dtype='int'. Try to use the int array is the output argument of the first two arrays.

  • Output arguments and timing Repeat the initial b = a ** 2 example using the output arguments and time it. Can you make it even faster using the output argument?

In numpy.rst:

In numpy.rst:

Numpy-5

If you have extra time, try these out. These are advanced and optional, and will not be done in most courses.

  1. Reverse a vector. Given a vector, reverse it such that the last element becomes the first, e.g. [1, 2, 3] => [3, 2, 1]

  2. Create a 2D array with zeros on the borders and 1 inside.

  3. Create a random array with elements [0, 1), then add 10 to all elements in the range [0.2, 0.7).

  4. What is np.round(0.5)? What is np.round(1.5)? Why?

  5. In addition to np.round, explore numpy.ceil, numpy.floor, numpy.trunc. In particular, take note of how they behave with negative numbers.

  6. Recall the identity \(\sin^2(x) + \cos^2(x) = 1\). Create a random 4x4 array with values in the range [0, 10). Now test the equality with numpy.equal. What result do you get with numpy.allclose() instead of np.equal?

  7. Create a 1D array with 10 random elements. Sort it.

  8. What’s the difference between np_array.sort() and np.sort(np_array)?

  9. For the random array in question 8, instead of sorting it, perform an indirect sort. That is, return the list of indices which would index the array in sorted order.

  10. Create a 4x4 array of zeros, and another 4x4 array of ones. Next combine them into a single 8x4 array with the content of the zeros array on top and the ones on the bottom. Finally, do the same, but create a 4x8 array with the zeros on the left and the ones on the right.

  11. NumPy functionality Create two 2D arrays and do matrix multiplication first manually (for loop), then using the np.dot function. Use %%timeit to compare execution times. What is happening?

In numpy.rst:

Advanced NumPy

In numpy-advanced.rst:

Exercises: Numpy-Advanced-1

Write a Python script that uses NumPy to generate 100 million (100000000) random numbers and add them all together. Time how long it takes to execute. Can you beat the C version?

If you are having trouble with this, we recommend completing the basic NumPy lesson before continuing with this advanced lesson. If you are taking a live course - don’t worry, watch and learn and explore some during the exercises!

In numpy-advanced.rst:

In numpy-advanced.rst:

Exercises: Numpy-Advanced-2

Write a function called ravel() that takes the row and column of an element in a 2D matrix and produces the appropriate index in an 1D array, where all the rows are concatenated. See the image above to remind yourself how each row of the 2D matrix ends up in the 1D array.

The function takes these inputs:

  • row The row of the requested element in the matrix as integer index.

  • col The column of the requested element in the matrix as integer index.

  • n_rows The total number of rows of the matrix.

  • n_cols The total number of columns of the matrix.

Here are some examples of input and desired output:

  • ravel(2, 3, n_rows=4, n_cols=4)11

  • ravel(2, 3, n_rows=4, n_cols=8)19

  • ravel(0, 0, n_rows=1, n_cols=1)0

  • ravel(3, 3, n_rows=4, n_cols=4)15

  • ravel(3_465, 18_923, n_rows=10_000, n_cols=20_000)69_318_923

In numpy-advanced.rst:

In numpy-advanced.rst:

Exercises: Numpy-Advanced-3

A little known feature of NumPy is the numpy.stride_tricks module that allows you to modify the strides attribute directly. Playing around with this is very educational.

  1. Create your own transpose() function that will transpose a 2D matrix by reversing its shape and strides attributes using numpy.lib.stride_tricks.as_strided().

  2. Create a (5 ✕ 100 000 000 000) array containing on the first row all 1’s, the second row all 2’s, and so on. Start with an 1D array a = np.array([1., 2., 3., 4., 5.]) and modify its shape and strides attributes using numpy.lib.stride_tricks.as_strided() to obtain the desired 2D matrix:

    array([[1., 1., 1., ..., 1., 1., 1.],
           [2., 2., 2., ..., 2., 2., 2.],
           [3., 3., 3., ..., 3., 3., 3.],
           [4., 4., 4., ..., 4., 4., 4.],
           [5., 5., 5., ..., 5., 5., 5.]])
    

In numpy-advanced.rst:

Pandas

In pandas.rst:

Exploring dataframes

  • Have a look at the available methods and attributes using the API reference or the autocomplete feature in Jupyter.

  • Try out a few methods using the Titanic dataset and have a look at the docstrings (help pages) of methods that pique your interest

  • Compute the mean age of the first 10 passengers by slicing and the pandas.DataFrame.mean() method

  • (Advanced) Using boolean indexing, compute the survival rate (mean of “Survived” values) among passengers over and under the average age.

In pandas.rst:

Analyze the Titanic passenger list dataset

In the Titanic passenger list dataset, investigate the family size of the passengers (i.e. the “SibSp” column).

  • What different family sizes exist in the passenger list? Hint: try the unique() method

  • What are the names of the people in the largest family group?

  • (Advanced) Create histograms showing the distribution of family sizes for passengers split by the fare, i.e. one group of high-fare passengers (where the fare is above average) and one for low-fare passengers (Hint: instead of an existing column name, you can give a lambda function as a parameter to hist() to compute a value on the fly. For example lambda x: "Poor" if df["Fare"].loc[x] < df["Fare"].mean() else "Rich").

In pandas.rst:

Analyze the Nobel prize dataset

  • What country has received the largest number of Nobel prizes, and how many? How many countries are represented in the dataset? Hint: use the describe method on the bornCountryCode column.

  • Create a histogram of the age when the laureates received their Nobel prizes. Hint: follow the above steps we performed for the lifespan.

  • List all the Nobel laureates from your country.

Now more advanced steps:

  • Now define an array of 4 countries of your choice and extract only laureates from these countries (you need to look at the data and find how countries are written, and replace COUNTRY with those strings):

    countries = np.array([COUNTRY1, COUNTRY2, COUNTRY3, COUNTRY4])
    subset = nobel.loc[nobel['bornCountry'].isin(countries)]
    
  • Use groupby() to compute how many nobel prizes each country received in each category. The size() method tells us how many rows, hence nobel prizes, are in each group:

    nobel.groupby(['bornCountry', 'category']).size()
    
  • (Optional) Create a pivot table to view a spreadsheet like structure, and view it

    • First add a column “number” to the nobel dataframe containing 1’s (to enable the counting below). We need to make a copy of subset, because right now it is only a view:

      subset = subset.copy()
      subset.loc[:, 'number'] = 1
      
    • Then create the pivot_table():

      table = subset.pivot_table(values="number", index="bornCountry", columns="category", aggfunc=np.sum)
      
  • (Optional) Install the seaborn visualization library if you don’t already have it, and create a heatmap of your table:

    import seaborn as sns
    sns.heatmap(table,linewidths=.5);
    
  • Play around with other nice looking plots:

    sns.violinplot(y=subset["year"].dt.year, x="bornCountry", inner="stick", data=subset);
    
    sns.swarmplot(y="year", x="bornCountry", data=subset, alpha=.5);
    
    subset_physchem = nobel.loc[nobel['bornCountry'].isin(countries) & (nobel['category'].isin(['physics']) | nobel['category'].isin(['chemistry']))]
    sns.catplot(x="bornCountry", y="year", col="category", data=subset_physchem, kind="swarm");
    
    sns.catplot(x="bornCountry", col="category", data=subset_physchem, kind="count");
    

Data visualization with Matplotlib

In data-visualization.md:

Exercise Matplotlib-1: extend the previous example (15 min)

  • Extend the previous plot by also plotting this set of values but this time using a different color (#56B4E9):

    # this is dataset 2
    data2_y = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
    
  • Then add another color (#009E73) which plots the second dataset, scaled by 2.0.

    # here we multiply all elements of data2_y by 2.0
    data2_y_scaled = [y * 2.0 for y in data2_y]
    
  • Try to add a legend to the plot with matplotlib.axes.Axes.legend() and searching the web for clues on how to add labels to each dataset. You can also consult this great quick start guide.

  • At the end it should look like this one:

    Result of the exercise
  • Experiment also by using named colors (e.g. “red”) instead of the hex-codes.

In data-visualization.md:

In data-visualization.md:

Exercise Customization-1: log scale in Matplotlib (15 min)

In this exercise we will learn how to use log scales.

  • To demonstrate this we first fetch some data to plot:

    import pandas as pd
    
    url = (
        "https://raw.githubusercontent.com/plotly/datasets/master/gapminder_with_codes.csv"
    )
    gapminder_data = pd.read_csv(url).query("year == 2007")
    
    gapminder_data
    
  • Try the above snippet in a notebook and it will give you an overview over the data.

  • Then we can plot the data, first using a linear scale:

    import matplotlib.pyplot as plt
    
    fig, ax = plt.subplots()
    
    ax.scatter(x=gapminder_data["gdpPercap"], y=gapminder_data["lifeExp"], alpha=0.5)
    
    ax.set_xlabel("GDP per capita (PPP dollars)")
    ax.set_ylabel("Life expectancy (years)")
    

    This is the result but we realize that a linear scale is not ideal here:

    Gapminder data plotted using a linear scale
  • Your task is to switch to a log scale and arrive at this result:

    Gapminder data plotted using log scale
  • What does alpha=0.5 do?

In data-visualization.md:

In data-visualization.md:

Exercise Customization-2: preparing a plot for publication (15 min)

Often we need to create figures for presentation slides and for publications but both have different requirements: for presentation slides you have the whole screen but for a figure in a publication you may only have few centimeters/inches.

For figures that go to print it is good practice to look at them at the size they will be printed in and then often fonts and tickmarks are too small.

Your task is to make the tickmarks and the axis label font larger, using Matplotlib parts of a figure and web search, and to arrive at this:

Gapminder data plotted with larger font and larger ticks

In data-visualization.md:

In data-visualization.md:

In data-visualization.md:

Data formats with Pandas and Numpy

In data-formats.rst:

Exercise

  • Create an arbitrary python object (for example, a string or a list). Pickle it.

    Read the pickled object back in and check if it matches the original one.

In data-formats.rst:

In data-formats.rst:

Exercise

  • Create the example dataset:

    import pandas as pd
    import numpy as np
    
    n_rows = 100000
    
    dataset = pd.DataFrame(
        data={
            'string': np.random.choice(('apple', 'banana', 'carrot'), size=n_rows),
            'timestamp': pd.date_range("20130101", periods=n_rows, freq="s"),
            'integer': np.random.choice(range(0,10), size=n_rows),
            'float': np.random.uniform(size=n_rows),
        },
    )
    
  • Save the dataset dataset as CSV. Load the dataset into a variable dataset_csv.

  • Use dataset.compare(dataset_csv) to check if loaded dataset matches the original one.

In data-formats.rst:

In data-formats.rst:

Exercise

  • Create an example numpy array:

    n = 1000
    
    data_array = np.random.uniform(size=(n,n))
    
  • Store the array as a npy.

  • Read the dataframe back in and compare it to the original one. Does the data match?

In data-formats.rst:

Productivity tools

In productivity.rst:

Using Pylint

The following code uses scikit-learn to fit a simple linear model to randomly generated data with some error. You can download it here (see above for how to easily download and run in JupyterLab).

It has four mistakes in it. One of these cannot be found by Pylint.

Fix the following code with Pylint and try to determine why Pylint did not find the last mistake.

"""
pylint exercise 1
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model


def f(x):
    """
    Example function:

    f(x) = x/2 + 2
    """"
    return 0.5*x + 2


# Create example data
x_data = np.linspace(0, 10, 100)
err = 2 * np.random.random(x_data.shape[0])
y_data = f(x_data) + err

# Put data into dataframe
df = pd.DataFrame({'x': x_data, 'y': y_data})

# Create linear model and fit data
reg = linear_model.LinearRegression(fit_intercept=True)

reg.fit(df[['x'], df[['y']])

slope = reg.coef_[0][0]
intercept = reg.intercept_[0]

df['pred'] = reg.predict(df[['x']])

fig, ax = plt.subplots()

ax.scater(df[['x']], df[['y']], alpha=0.5)
ax.plot(df[['x']], df[['pred']]
        color='black', linestyle='--',
        label=f'Prediction with slope {slope:.2f} and intercept {intercept:.2f}')
ax.set_ylabel('y')
ax.set_xlabel('x')
ax.legend()

plt.show()

In productivity.rst:

In productivity.rst:

Using black to format code

Format this code with black:

import numpy as np
import matplotlib.pyplot  as plt

def dice_toss(n,m):

    """Throw n dice m times and the total value together."""
    dice_rolls    = np.random.randint(1,6,size=(m, n))

    roll_averages = np.sum(dice_rolls,axis = -1)

    return roll_averages
fig,ax = plt.subplots( )

n = int( input('Number of dices to toss:\n'))

bins = np.arange(1, 6 * n+1)

m = 1000

ax.hist(dice_toss(n,m), bins = bins)

ax.set_title(f'Histogram of {n} dice tosses')

ax.set_xlabel('Total value' )

ax.set_ylabel('Number of instances')

plt.show()

In productivity.rst:

Scripts

In scripts.rst:

Scripts-1

  1. Download the weather_observations.ipynb and upload them to your Jupyterlab. The script plots the temperature data for Tapiola in Espoo. The data is originally from rp5.kz and was slightly adjusted for this lecture.

    Hint: Copy the URL above (right-click) and in JupyterLab, use File → Open from URL → Paste the URL. It will both download it to the directory JupyterLab is in and open it for you.

  2. Open a terminal in Jupyter: File → New Launcher, then click “Terminal” there. (if you do it this way, it will be in the right directory. File → New → Terminal might not be.)

  3. Convert the Jupyter script to a Python script by calling:

    $ jupyter nbconvert --to script weather_observations.ipynb
    
  4. Run the script (note: you may have python3 rather than python):

    $ python weather_observations.py
    

In scripts.rst:

Scripts-2

  1. Take the Python script (weather_observations.py) we have written in the preceding exercise and use argparse to specify the input (URL) and output files and allow the start and end dates to be set.

    • Hint: try not to do it all at once, but add one or two arguments, test, then add more, and so on.

    • Hint: The input and output filenames make sense as positional arguments, since they must always be given. Input is usually first, then output.

    • Hint: The start and end dates should be optional parameters with the defaults as they are in the current script.

  2. Execute your script for a few different time intervals (e.g. from January 2019 to June 2020, or from May 2020 to October 2020). Also try using this data for Cairo: https://raw.githubusercontent.com/AaltoSciComp/python-for-scicomp/master/resources/data/scripts/weather_cairo.csv

In scripts.rst:

In scripts.rst:

Scripts-3

  1. Download the optionsparser.py function and load it into your working folder in Jupyterlab (Hint: in JupyterLab, File → Open from URL). Modify the previous script to use a config file parser to read all arguments. The config file is passed in as a single argument on the command line (using e.g. argparse or sys.argv) still needs to be read from the command line.

  2. Run your script with different config files.

In scripts.rst:

SciPy

In scipy.rst:

Exercise

Do the following exercise or read the documentation and understand the relevant functions of SciPy:

Define a function of one variable and using scipy.integrate.quad calculate the integral of your function in the interval [0.0, 4.0]. Then vary the interval and also modify the function and check whether scipy can integrate it.

In scipy.rst:

In scipy.rst:

Exercise

Do the following exercise or read the documentation and understand the relevant functions of SciPy:

Use the SciPy sparse matrix functionality to create a random sparse matrix with a probability of non-zero elements of 0.05 and size 10000 x 10000. The use the SciPy sparse linear algebra support to calculate the matrix-vector product of the sparse matrix you just created and a random vector. Use the %timeit macro to measure how long it takes. Does the optional format argument when you create the sparse matrix make a difference?

Then, compare to how long it takes if you’d instead first convert the sparse matrix to a normal NumPy dense array, and use the NumPy dot method to calculate the matrix-vector product.

Can you figure out a quick rule of thumb when it’s worth using a sparse matrix representation vs. a dense representation?

In scipy.rst:

Library ecosystem

In libraries.rst:

Libraries 1.1: Libraries in your work

What libraries do you use in your work? What have you made, which you could have reused from some other source. What have you used from some other source that you wished you had re-created?

Discuss in your groups or HackMD.

In libraries.rst:

In libraries.rst:

Libraries 1.2: Evaluating packages

Below are some links to some packages, both public and made by the authors of this lesson. Evaluate them, considering “would I use this in my project?”

  1. https://github.com/networkx/networkx/

  2. some code on webpage in a paper’s footnote

  3. https://github.com/rkdarst/pcd

  4. https://github.com/dftlibs/numgrid

  5. https://github.com/rkdarst/dynbench

  6. https://vpython.org/

In libraries.rst:

Dependency management

In dependencies.rst:

Dependencies-1: Discuss dependency management (5 min)

Please discuss and answer via collaborative document the following questions:

  • How do you install Python packages (libraries) that you use in your work? From PyPI using pip? From other places using pip? Using conda?

  • How do you track/record the dependencies? Do you write them into a file or README? Into requirements.txt or environment.yml?

  • If you track dependencies in a file, why do you do this?

  • Have you ever experienced that a project needed a different version of a Python library than the one on your computer? If yes, how did you solve it?

In dependencies.rst:

Dependencies-2: Create a conda environment (15 min)

Chloe just joined your team and will be working on her Master Thesis. She is quite familiar with Python, still finishing some Python assignments (due in a few weeks) and you give her a Python code for analyzing and plotting your favorite data. The thing is that your Python code has been developed by another Master Student (from last year) and requires a older version of Numpy (1.24.3) and Matplotlib (3.7.2) (otherwise the code fails). The code could probably work with a recent version of Python but has been validated with Python 3.10 only. Having no idea what the code does, she decides that the best approach is to create an isolated environment with the same dependencies that were used previously. This will give her a baseline for future upgrade and developments.

For this first exercise, we will be using conda for creating an isolated environment.

  1. Create a conda environment:

    $ conda create --name python310-env python=3.10 numpy=1.24.3 matplotlib=3.7.2
    

Conda environments can also be managed (create, update, delete) from the anaconda-navigator. Check out the corresponding documentation here.

  1. Activate the environment:

    $ conda activate python310-env
    

    conda activate versus source activate

    conda activate will only work if you have run conda init in the past. Running conda init will make loading environments easier as you will always have a conda environment loaded.

    However, this can also cause problems as programs in the main environment will be constantly loaded and they might be used even when they’re not supposed to be used. A common example is not having pip installed in a conda environment which results pip from main environment to be used instead.

    You can always try:

    $ source activate python310-env
    
  2. Open a Python console and check that you have effectively the right version for each package:

    import numpy
    import matplotlib
    
    print('Numpy version: ', numpy.__version__)
    print('Matplotlib version: ', matplotlib.__version__)
    

    Or use the one-liner if you have access to a terminal like bash:

    $ python -c 'import numpy; print(numpy.__version__)'
    $ python -c 'import matplotlib;print(matplotlib.__version__)'
    
  3. Deactivate the environment:

    $ conda deactivate
    
  4. Check Numpy and Matplotlib versions in the default environment to make sure they are different from python310-env.

There is no need to specify the conda environment when using deactivate. It deactivates the current environment.

In dependencies.rst:

Dependencies-3: Create a virtualenv (15 min, optional)

This is the same exercise as before but we use virtualenv rather than conda.

  1. Create a venv:

    $ python3 -m venv scicomp
    

    Here scicomp is the name of the virtual environment. It creates a new folder called scicomp.

  2. Activate it. To activate your newly created virtual environment locate the script called activate and source it.

    • Linux/Mac-OSX: look at bin folder in the scicomp folder:

      $ source scicomp/bin/activate
      
    • Windows: most likely you can find it in the Scripts folder.

  3. Install Numpy 1.24.3 and Matplotlib 3.7.2 into the virtual environment:

    $ pip install numpy==1.24.3
    $ pip install matplotlib==3.7.2
    
  4. Deactivate it:

    $ deactivate
    

In dependencies.rst:

Dependencies-4: Freeze an environment (15 min)

  • Create the file environment.yml or requirements.txt

  • Create an environment based on these dependencies:
    • Conda: $ conda env create --file environment.yml

    • Virtual environment: First create and activate, then $ pip install -r requirements.txt

  • Freeze the environment:
    • Conda: $ conda env export > environment.yml

    • Virtual environment: $ pip freeze > requirements.txt

  • Have a look at the generated (“frozen”) file.

Binder

In binder.rst:

Binder-1: Discuss better strategies than only code sharing (10 min)

Lea is a PhD student in computational biology and after 2 years of intensive work, she is finally ready to publish her first paper. The code she has used for analyzing her data is available on GitHub but her supervisor who is an advocate of open science told her that sharing code is not sufficient.

Why is it possibly not enough to share “just” your code? What problems can you anticipate 2-5 years from now?

We form small groups (4-5 persons) and discuss in groups. If the workshop is online, each group will join a breakout room. If joining a group is not possible or practical, we use the shared document to discuss this collaboratively.

Each group write a summary (bullet points) of the discussion in the workshop shared document (the link will be provided by your instructors).

In binder.rst:

Binder-2: Exercise/demo: Make your notebooks reproducible by anyone (15 min)

Instructor demonstrates this. This exercise (and all following) requires git/GitHub knowledge and accounts, which wasn’t a prerequisite of this course. Thus, this is a demo (and might even be too fast for you to type-along). Watch the video if you are reading this later on:

  • Creates a GitHub repository

  • Uploads the notebook file

  • Then we look at the statically rendered version of the notebook on GitHub

  • Create a requirements.txt file which contains:

    pandas==1.2.3
    matplotlib==3.4.2
    
  • Commit and push also this file to your notebook repository.

  • Visit https://mybinder.org and copy paste the code under “Copy the text below …” into your README.md:

    ../_images/binder.jpg
  • Check that your notebook repository now has a “launch binder” badge in your README.md file on GitHub.

  • Try clicking the button and see how your repository is launched on Binder (can take a minute or two). Your notebooks can now be explored and executed in the cloud.

  • Enjoy being fully reproducible!

In binder.rst:

In binder.rst:

Parallel programming

In parallel.rst:

Parallel-1, multiprocessing

Here, you find some code which calculates pi by a stochastic algorithm. You don’t really need to worry how the algorithm works, but it computes random points in a 1x1 square, and computes the number that fall into a circle. Copy it into a Jupyter notebook and use the %%timeit cell magic on the computation part (the one highlighted line after timeit below):

import random

def sample(n):
    """Make n trials of points in the square.  Return (n, number_in_circle)

    This is our basic function.  By design, it returns everything it\
    needs to compute the final answer: both n (even though it is an input
    argument) and n_inside_circle.  To compute our final answer, all we
    have to do is sum up the n:s and the n_inside_circle:s and do our
    computation"""
    n_inside_circle = 0
    for i in range(n):
        x = random.random()
        y = random.random()
        if x**2 + y**2 < 1.0:
            n_inside_circle += 1
    return n, n_inside_circle

%%timeit
n, n_inside_circle = sample(10**6)

pi = 4.0 * (n_inside_circle / n)
pi

Using the multiprocessing.pool.Pool code from the lesson, run the sample function 10 times, each with 10**5 samples only. Combine the results and time the calculation. What is the difference in time taken?

NOTE: If you’re working in an interactive environment and this doesn’t work with the multiprocessing module, install and use the multiprocess module instead!

(optional, advanced) Do the same but with multiprocessing.pool.ThreadPool instead. This works identically to Pool, but uses threads instead of different processes. Compare the time taken.

In parallel.rst:

(advanced) Parallel-2 Running on a cluster

How does the pool know how many CPUs to take? What happens if you run on a computer cluster and request only part of the CPUs on a node?

In parallel.rst:

Parallel-3, MPI

We can do this as exercise or as demo. Note that this example requires mpi4py and a MPI installation such as for instance OpenMPI.

  • Try to run this example on one core: $ python example.py.

  • Then compare the output with a run on multiple cores (in this case 2): $ mpiexec -n 2 python example.py.

  • Can you guess what the comm.gather function does by looking at the print-outs right before and after.

  • Why do we have the if-statement if rank == 0 at the end?

  • Why did we use _, n_inside_circle = sample(n_task) and not n, n_inside_circle = sample(n_task)?

In parallel.rst:

Dask-Examples (optional)

Dask examples illustrate the usage of dask and can be run interactively through mybinder. Start an interactive session on mybinder and test/run a few dask examples.

Packaging

In packaging.rst:

Packaging-1

To test a local pip install:

  • Create a new folder outside of our example project

  • Create a new virtual environment (Dependency management)

  • Install the example package from the project folder into the new environment: $ pip install /path/to/project-folder/

  • Test the local installation:

from calculator import add, subtract, integral

print(add(2, 3))
print(subtract(2, 3))
print(integral(lambda x: x * x, 0.0, 1.0))

Web APIs with Python

In web-apis.ipynb:

Exercise WebAPIs-1: Request different activity suggestions from the Bored API

Go to the documentation page of the Bored API. The Bored API is an open API which can be used to randomly generate activity suggestions.

Let’s examine the first sample query on the page http://www.boredapi.com/api/activity/ with a sample JSON response

{
    "activity": "Learn Express.js",
    "accessibility": 0.25,
    "type": "education",
    "participants": 1,
    "price": 0.1,
    "link": "https://expressjs.com/",
    "key": "3943506"
} 

Let’s replicate the query and see if we can get another random suggestion.

In web-apis.ipynb:

Exercise WebAPIs-2: Examine request and response headers

Request headers are similar to request parameters but usually define meta information regarding, e.g., content encoding (gzip, utf-8) or user identification (user-agent/user ID/etc., password/access token/etc.).

Let’s first make a request.

In web-apis.ipynb: