What code has to teach us #1: the impact of implicit behavior

“The master has failed more times than the beginner has even tried”
– Stephen McCranie

As Research Software Engineers (RSEs), we read and write a lot of code. In this series of blog posts, we are going to share some snippets that taught us important lessons, and thereby impart that wisdom unto you. These snippets are taken from actual research code, responsible for producing results that end up in peer-reviewed scientific articles. That is to say, results that we should have some confidence in to be correct. However, problems have a way of cropping up in the most unexpected places and when they do, there is a chance to learn from them.

The impact of implicit behavior

I was in the metro zooming through Lauttasaari when I received an email from my professor that made my heart skip a beat. We just submitted a paper to Nature Communications and were all still a little giddy about finally sending off the project we had been working on for 3 years. She and the first author had been chatting about the cool methods we had been using for the project and a question arose: were we 100% certain that we “removed copies of the selected stimuli from the train set”? If we hadn’t, we would have to quickly pull back our submission, but surely we had, right? I thought we did. At least, I distinctly remember writing the code to do it. Just to be on the safe side, I decided to double check the code.

Below is the analysis script in question. It reads some data, performs some preprocessing, feeds into the a machine learning algorithm called zero_shot_decoding, and stores the output. I present it here to you in full, because there are many subtleties working together that make this situation so scary. The question I pose to you, dear reader, is this: were the highlighted lines (118–120) executed, or did we have to pull our submission?

  1 import numpy as np
  2 from scipy.io import loadmat, savemat
  3 from scipy.stats import zscore
  4 from zero_shot_decoding import zero_shot_decoding
  5 #print('Code version:'+ subprocess.check_output(['git', 'rev-parse', 'HEAD']))
  6
  7 # Default location of the norm data (see also the --norms command line parameter)
  8 norm_file = '../data/corpusvectors_ginter_lemma.mat'
  9
 10 # Handle command line arguments
 11 parser = argparse.ArgumentParser(description='Run zero-shot learning on a single subject.')
 12 parser.add_argument('input_file', type=str,
 13                     help='The file that contains the subject data; should be a .mat file.')
 14 parser.add_argument('-s', '--subject-id', metavar='Subject ID', type=str, required=True,
 15                     help='The subject-id (as string). This number is recorded in the output .mat file.')
 16 parser.add_argument('--norms', metavar='filename', type=str, default=norm_file,
 17                     help='The file that contains the norm data. Defaults to %s.' % norm_file)
 18 parser.add_argument('-o', '--output', metavar='filename', type=str, default='results.mat',
 19                     help='The file to write the results to; should end in .mat. Defaults to results.mat')
 20 parser.add_argument('-v', '--verbose', action='store_true',
 21                     help='Whether to show a progress bar')
 22 parser.add_argument('-b', '--break-after', metavar='N', type=int, default=-1,
 23                     help='Break after N iterations (useful for testing)')
 24 parser.add_argument('-n', '--n_voxels', metavar='N voxels', type=int, default=500,
 25                     help='Number of voxels. Used only for results file name.')
 26 parser.add_argument('-d', '--distance-metric', type=str, default='cosine',
 27                     help=("The distance metric to use. Any distance implemented in SciPy's "
 28                           "spatial.distance module is supported. See the docstring of "
 29                           "scipy.spatial.distance.pdict for the exhaustive list of possitble "
 30                           "metrics. Here are some of the more useful ones: "
 31                           "'euclidean' - Euclidean distance "
 32                           "'sqeuclidean' - Squared euclidean distance "
 33                           "'correlation' - Pearson correlation "
 34                           "'cosine' - Cosine similarity (the default)"))
 35 args = parser.parse_args()
 36
 37 verbose = args.verbose
 38 if args.break_after > 0:
 39     break_after = args.break_after
 40 else:
 41     break_after = None
 42
 43 print('Subject:', args.subject_id)
 44 print('Input:', args.input_file)
 45 print('Output:', args.output)
 46 print('Norms:', args.norms)
 47 print('Distance metric:', args.distance_metric)
 48
 49
 50 m = loadmat(args.input_file)
 51 if 'brainVecsReps' in m:
 52     # File without stability selection enabled
 53     print('Stability selection DISABLED')
 54     X = np.array([m['brainVecsReps'][0][i] for i in range(m['brainVecsReps'][0].shape[0])])
 55     n_repetitions, n_stimuli, n_voxels = X.shape
 56     voxel_ids = []
 57
 58     # Drop all voxels that contain NaN's for any items
 59     non_nan_mask = ~np.any(np.any(np.isnan(X), axis=1), axis=0)
 60     non_nan_indices = np.flatnonzero(non_nan_mask)
 61     X = X[:, :, non_nan_mask]
 62
 63     # Normalize betas across items
 64     X = zscore(X, axis=1, ddof=1)
 65
 66     # Average over the repetitions
 67     X = X.mean(axis=0)
 68
 69     X_perm = None
 70     splits = None
 71
 72 elif 'mask_voxels' in m:
 73     # File without stability selection enabled
 74     print('Stability selection DISABLED')
 75     X = m['mask_voxels']
 76     voxel_ids = m['voxel_ids']
 77     n_stimuli, n_voxels = X.shape
 78     X_perm = None
 79     splits = None
 80
 81 elif 'top_voxels_perm' in m:
 82     # File with stability selection enabled
 83     print('Stability selection ENABLED')
 84     X_perm = m['top_voxels_perm']
 85     X = m['top_voxels_all']
 86     voxel_ids = m['top_voxel_ids']
 87     n_stimuli, n_voxels, _ = X_perm.shape
 88
 89     assert os.path.isfile('leave2out_index.npy')
 90     splits = np.load('leave2out_index.npy')
 91
 92 elif 'brainVecs' in m:
 93     # File with single-trial data
 94     print('Stability selection DISABLED, single-trial data')
 95     X = m['brainVecs']
 96     voxel_ids = m['voxindex']
 97     n_stimuli, n_voxels = X.shape
 98     X_perm = None
 99
100     def generate_splits(n_stimuli, block_size=60):
101         """Generate train-set, test-set splits.
102
103         To save computation time, we don't do the full 360*359/2 iterations.
104         Instead we will do the leave-2-out scheme block-wise and use the rest
105         of the data for training.
106         """
107         assert n_stimuli % block_size == 0
108         n_blocks = n_stimuli // block_size
109         for x in range(n_stimuli):
110             for y in range(x + 1, n_stimuli):
111                 # Don't make the model distinguish between duplicate stimuli
112                 if x % block_size == y % block_size:
113                     continue
114
115                 test_set = [x, y]
116                 train_set = np.setdiff1d(np.arange(n_stimuli), test_set)
117
118                 # Remove copies of the selected stimuli from the train set
119                 train_set = np.setdiff1d(train_set, [i * block_size + (x % block_size) for i in range(n_blocks)])
120                 train_set = np.setdiff1d(train_set, [i * block_size + (y % block_size) for i in range(n_blocks)])
121
122                 yield train_set, test_set
123
124     splits = generate_splits(n_stimuli)
125
126 else:
127     raise RuntimeError('Could not find any suitable data in the supplied input file.')
128
129 # Load the norm data
130 m = loadmat(args.norms)
131 y = m['newVectors']
132
133 if not np.isfinite(y).all():
134     raise RuntimeError('The norm data contains NaNs or Infs.')
135 if not np.isfinite(X).all():
136     raise RuntimeError('The brain data contains NaNs or Infs.')
137
138 pairwise_accuracies, model, target_scores, predicted_y, patterns = zero_shot_decoding(
139     X, y, X_perm, verbose=verbose, break_after=break_after, metric=args.distance_metric, cv_splits=splits
140 )
141
142 savemat(args.output, {
143     'pairwise_accuracies': pairwise_accuracies,
144     'weights': model.coef_,
145     'feat_scores': target_scores,
146     'subject': args.subject_id,
147     'inputfile': args.input_file,
148     'alphas': model.alpha_,
149     'voxel_ids': voxel_ids,
150     'predicted_y': predicted_y,
151     'patterns': patterns,
152 })

Lessons this code has to teach us

The first thing that went through my head, as it probably went through yours, was: this code is so long and complicated, answering this seemingly simple question is going to take some time to figure out. And I won’t blame you for giving up right then and there. Hunched over my laptop while the metro passed through Ruoholahti, I tried to trace the logic of the script.

First problem: much of the behavior of the script is dictated by the command line arguments. Luckily, their values are saved in the output file, so I could check that they were correct.

Note

Lesson: always error on the side of caution when deciding whether it is worth storing something in the result file.

That brings us to the big if-statement. Did the correct branch execute? Well, that depends on what was in the m dictionary, which translates to what variables were defined in the MATLAB file used as input to the script. If we had used the wrong variable name, i.e. brainVecsReps instead of brainVecs, when creating the input file, the wrong branch would have executed and the script would have been happily computing the wrong thing. And we would never know. If we had used the wrong input file, or the wrong version of the input file, the wrong branch would have executed without any indication that something was wrong. So many opportunities for small mistakes to lead to a big error.

Note

Lesson: have the user be explicit in what they want to do, so the script can check the user’s intent against the inputs and raise a nice big error if they screwed up. In this case, there should really have been either a command line parameter determining which branch to execute, or even better, this should have been four separate scripts.

In the end I ended up searching the logfile for the line Stability selection DISABLED, single-trial data which, thankfully, was there, so the correct branch did execute.

Note

Lesson: be liberal with print-statements (or other logging directives) in your scripts; cherish the resulting logfiles.

I breathed a sigh of relieved as the metro pulled into the central railway station.

This if-statement is a work of insanity. What was I thinking determining what the script should be doing based on a mostly random naming scheme of some variables in a MATLAB file? I got lucky that time. But from that moment on, I would heed this lesson:

Note

Explicit is better than implicit.
– The Zen of Python, by Tim Peters