Advanced NumPy

Questions

How can NumPy be so fast?
Why are some things fast and some things slow?
How can I control whether NumPy makes a copy or operates in-place?

Objectives

Understand why NumPy has so many specialized functions for specific operations
Understand the underlying machinery of the Numpy ndarray object
Understand when and why NumPy makes a copy of the data rather than a view

This is intended as a follow-up to the basic NumPy lesson. The intended audience for this advanced lesson is those who have used NumPy before and now want to learn how to get the most out of this amazing package.

Python, being an interpreted programming language, is quite slow. Manipulating large amounts of numbers using Python’s build-in lists would be impractically slow for any serious data analysis. Yet, the NumPy package can be really fast. How does it do that? We will dive into how NumPy works behind the scenes and use this knowledge to our advantage. This lesson also serves as an introduction to reading the definitive work on this topic: Guide to NumPy by Travis E. Oliphant, its initial creator.

NumPy can be really fast

Python, being an interpreted programming language, is quite slow. Manipulating large amounts of numbers using Python’s build-in lists would be impractically slow for any serious data analysis. Yet, the numpy package can be really fast.

How fast can NumPy be? Let’s race NumPy against C. The contest will be to sum together 100 000 000 random numbers. We will give the C version below, you get to write the NumPy version:

#include <stdlib.h>
#include <stdio.h>
#define N_ELEMENTS 100000000
int main(int argc, char** argv) {
    double* a = (double*) malloc(sizeof(double) * N_ELEMENTS);
    int i;
    for(i=0; i<N_ELEMENTS; ++i) {
        a[i] = (double) rand() / RAND_MAX;
    }
    double sum = 0;
    for(i=0; i<N_ELEMENTS; ++i) {
        sum += a[i];
    }
    printf("%f", sum);
    return 0;
}

Exercise 1

Exercises: Numpy-Advanced-1

Write a Python script that uses NumPy to generate 100 million (100000000) random numbers and add them all together. Time how long it takes to execute. Can you beat the C version?

If you are having trouble with this, we recommend completing the basic NumPy lesson before continuing with this advanced lesson. If you are taking a live course - don’t worry, watch and learn and explore some during the exercises!

Solutions: Numpy-Advanced-1

The script can be implemented like this:

import numpy as np
print(np.random.rand(100_000_000).sum())

The libraries behind the curtain: MKL and BLAS

NumPy is fast because it outsources most of its heavy lifting to heavily optimized math libraries, such as Intel’s Math Kernel Library (MKL), which are in turn derived from a Fortran library called Basic Linear Algebra Subprograms (BLAS). BLAS for Fortran was published in 1979 and is a collection of algorithms for common mathematical operations that are performed on arrays of numbers. Algorithms such as matrix multiplication, computing the vector length, etc. The API of the BLAS library was later standardized, and today there are many modern implementations available. These libraries represent over 40 years of optimizing efforts and make use of specialized CPU instructions for manipulating arrays. In other words, they are fast.

One of the functions inside the BLAS library is a function to compute the “norm” of a vector, which is the same as computing its length, using the Pythagorean theorem: \(\sqrt(a[0]^2 + a[1]^2 + \ldots)\).

Let’s race the BLAS function versus a naive “manual” version of computing the vector norm. We start by creating a decently long vector filled with random numbers:

import numpy as np
rng = np.random.default_rng(seed=0)
a = rng.random(100_000_000)

We now implement the Pythagorean theorem using basic NumPy functionality and use %%timeit to record how long it takes to execute:

%%timeit
l = np.sqrt(np.sum(a ** 2))
print(l)

And here is the version using the specialized BLAS function norm():

%%timeit
l = np.linalg.norm(a)
print(l)

NumPy tries to avoid copying data

Understanding the kind of operations that are expensive (take a long time) and which ones are cheap can be surprisingly hard when it comes to NumPy. A big part of data processing speed is memory management. Copying big arrays takes time, so the less of that we do, the faster our code runs. The rules of when NumPy copies data are not trivial and it is worth your while to take a closer look at them. This involves developing an understanding of how NumPy’s numpy.ndarray datastructure works behind the scenes.

An example: matrix transpose

Transposing a matrix means that all rows become columns and all columns become rows. All off-diagonal values change places. Let’s see how long NumPy’s transpose function takes, by transposing a huge (10 000 ✕ 20 000) rand() matrix:

import numpy as np
a = np.random.rand(10_000, 20_000)
print(f'Matrix `a` takes up {a.nbytes / 10**6} MB')

Let’s time the transpose() method:

%%timeit
b = a.transpose()

It takes mere nanoseconds to transpose 1600 MB of data! How?

The ndarray exposed

The first thing you need to know about numpy.ndarray is that the memory backing it up is always a flat 1D array. For example, a 2D matrix is stored with all the rows concatenated as a single long vector.

NumPy is faking the second dimension behind the scenes! When we request the element at say, [2, 3], NumPy converts this to the correct index in the long 1D array [11].

Converting [2, 3] → [11] is called “raveling”

The reverse, converting [11] → [2, 3] is called “unraveling”

The implications of this are many, so take let’s take some time to understand it properly by writing our own ravel() function.

Exercise 2

Exercises: Numpy-Advanced-2

Write a function called ravel() that takes the row and column of an element in a 2D matrix and produces the appropriate index in an 1D array, where all the rows are concatenated. See the image above to remind yourself how each row of the 2D matrix ends up in the 1D array.

The function takes these inputs:

row The row of the requested element in the matrix as integer index.

col The column of the requested element in the matrix as integer index.

n_rows The total number of rows of the matrix.

n_cols The total number of columns of the matrix.

Here are some examples of input and desired output:

ravel(2, 3, n_rows=4, n_cols=4) → 11

ravel(2, 3, n_rows=4, n_cols=8) → 19

ravel(0, 0, n_rows=1, n_cols=1) → 0

ravel(3, 3, n_rows=4, n_cols=4) → 15

ravel(3_465, 18_923, n_rows=10_000, n_cols=20_000) → 69_318_923

Solutions: Numpy-Advanced-2

The function can be implemented like this:

def ravel(row, col, n_rows, n_cols):
  return row * n_cols + col

Strides

As seen in the exercise, to get to the next row, we have to skip over n_cols indices. To get to the next column, we can just add 1. To generalize this code to work with an arbitrary number of dimensions, NumPy has the concept of “strides”:

np.zeros((4, 8)).strides           # (64, 8)
np.zeros((4, 5, 6, 7, 8)).strides  # (13440, 2688, 448, 64, 8)

The strides attribute contains for each dimension, the number of bytes (not array indexes) we have to skip over to get to the next element along that dimension. For example, the result above tells us that to get to the next row in a 4 ✕ 8 matrix, we have to skip ahead 64 bytes. 64? Yes! We have created a matrix consisting of double-precision floating point numbers. Each one of those bad boys takes up 8 bytes, so all the indices are multiplied by 8 to get to the proper byte in the memory array. To move to the next column in the matrix, we skip ahead 8 bytes.

So now we know the mystery behind the speed of transpose(). NumPy can avoid copying any data by just modifying the strides of the array:

import numpy as np

a = np.random.rand(10_000, 20_000)
b = a.transpose()

print(a.strides)  # (160000, 8)
print(b.strides)  # (8, 160000)

Another example: reshaping

Modifying the shape of an array through numpy.reshape() is also accomplished without any copying of data by modifying the strides:

a = np.random.rand(20_000, 10_000)
print(f'{a.strides=}')  # (80000, 8)
b = a.reshape(40_000, 5_000)
print(f'{b.strides=}')  # (40000, 8)
c = a.reshape(20_000, 5_000, 2)
print(f'{c.strides=}')  # (80000, 16, 8)

Exercises 3

Exercises: Numpy-Advanced-3

A little known feature of NumPy is the numpy.stride_tricks module that allows you to modify the strides attribute directly. Playing around with this is very educational.

Create your own transpose() function that will transpose a 2D matrix by reversing its shape and strides attributes using numpy.lib.stride_tricks.as_strided().
Create a (5 ✕ 100 000 000 000) array containing on the first row all 1’s, the second row all 2’s, and so on. Start with an 1D array a = np.array([1., 2., 3., 4., 5.]) and modify its shape and strides attributes using numpy.lib.stride_tricks.as_strided() to obtain the desired 2D matrix:
```
array([[1., 1., 1., ..., 1., 1., 1.],
       [2., 2., 2., ..., 2., 2., 2.],
       [3., 3., 3., ..., 3., 3., 3.],
       [4., 4., 4., ..., 4., 4., 4.],
       [5., 5., 5., ..., 5., 5., 5.]])
```

Solutions: Numpy-Advanced-3

The transpose() function can be implemented like this:

from numpy.lib.stride_tricks import as_strided
def transpose(a):
  return as_strided(a, shape=a.shape[::-1], strides=a.strides[::-1])

# Testing the function on a small matrix
a = np.array([[1, 2, 3],
              [4, 5, 6]])
print('Before transpose:')
print(a)
print('After transpose:')
print(transpose(a))

By setting one of the .strides to 0, we can repeat a value infinitely many times without using any additional memory:
from numpy.lib.stride_tricks import as_strided
a = np.array([1., 2., 3., 4., 5.])
as_strided(a, shape=(5, 100_000_000_000), strides=(8, 0))

A fast thing + a fast thing = a fast thing?

If numpy.transpose() is fast, and numpy.reshape() is fast, then doing them both must be fast too, right?:

# Create a large array
a = np.random.rand(10_000, 20_000)

Measuring the time it takes to first transpose and then reshape:

%%timeit -n 1 -r 1
a.T.reshape(40_000, 5_000)

In this case, the data actually had to be copied and it’s super slow (it takes seconds instead of nanoseconds). When the array is first created, it is laid out in memory row-by-row (see image above). The transpose left the data laid out in memory column-by-column. To see why the copying of data was inevitable, look at what happens to this smaller (2 ✕ 3) matrix after transposition and reshaping. You can verify for yourself there is no way to get the final array based on the first array and some clever setting of the strides:

a = np.array([[1, 2, 3], [4, 5, 6]])

print('Original array:')
print(a)

print('\nTransposed:')
print(a.T)

print('\nTransposed and then reshaped:')
print(a.T.reshape(2, 3))

Copy versus view

Whenever NumPy constructs a new array by modifying the strides instead of copying data, we way it created a “view”. This also happens when we select only a portion of an existing matrix. Whenever a view is created, the numpy.ndarray object will have a reference to the original array in its base attribute:

a = np.zeros((5, 5))
print(a.base)  # None
b = a[:2, :2]
print(b.base.shape)  # (5, 5)

Warning

When you create a large array and select only a portion of it, the large array will stay in memory if a view was created!

The new array b object has a pointer to the same memory buffer as the array it has been derived from:

print(a.__array_interface__['data'])
print(b.__array_interface__['data'])

Views are created by virtue of modifying the value of the shape attribute and, if necessary, apply an offset to the pointer into the memory buffer so it no longer points to the start of the buffer, but somewhere in the middle:

b = a[1:3, 1:3]   # This view does not start at the beginning
offset = b.__array_interface__['data'][0] - a.__array_interface__['data'][0]
print('Offset:', offset, 'bytes')  # Offset: 48 bytes

Since the base array and its derived view share the same memory, any changes to the data in a view also affects the data in the base array:

b[0, 0] = 1.
print(a)  # Original matrix was modified

Whenever you index an array, NumPy will attempt to create a view. Whether or not that succeeds depends on the memory layout of the array and what kind of indexing operation was done. If no view can be created, NumPy will create a new array and copy over the selected data:

c = a[[0, 2]]  # Select rows 0 and 2
print(c.base)  # None. So not a view.