What code has to teach us #1: the impact of implicit behavior
“The master has failed more times than the beginner has even tried”– Stephen McCranie
As Research Software Engineers (RSEs), we read and write a lot of code. In this series of blog posts, we are going to share some snippets that taught us important lessons, and thereby impart that wisdom unto you. These snippets are taken from actual research code, responsible for producing results that end up in peer-reviewed scientific articles. That is to say, results that we should have some confidence in to be correct. However, problems have a way of cropping up in the most unexpected places and when they do, there is a chance to learn from them.
The impact of implicit behavior
I was in the metro zooming through Lauttasaari when I received an email from my professor that made my heart skip a beat. We just submitted a paper to Nature Communications and were all still a little giddy about finally sending off the project we had been working on for 3 years. She and the first author had been chatting about the cool methods we had been using for the project and a question arose: were we 100% certain that we “removed copies of the selected stimuli from the train set”? If we hadn’t, we would have to quickly pull back our submission, but surely we had, right? I thought we did. At least, I distinctly remember writing the code to do it. Just to be on the safe side, I decided to double check the code.
Below is the analysis script in question.
It reads some data, performs some preprocessing, feeds into the a machine learning algorithm called zero_shot_decoding
, and stores the output.
I present it here to you in full, because there are many subtleties working together that make this situation so scary.
The question I pose to you, dear reader, is this: were the highlighted lines (118–120) executed, or did we have to pull our submission?
1 import numpy as np
2 from scipy.io import loadmat, savemat
3 from scipy.stats import zscore
4 from zero_shot_decoding import zero_shot_decoding
5 #print('Code version:'+ subprocess.check_output(['git', 'rev-parse', 'HEAD']))
6
7 # Default location of the norm data (see also the --norms command line parameter)
8 norm_file = '../data/corpusvectors_ginter_lemma.mat'
9
10 # Handle command line arguments
11 parser = argparse.ArgumentParser(description='Run zero-shot learning on a single subject.')
12 parser.add_argument('input_file', type=str,
13 help='The file that contains the subject data; should be a .mat file.')
14 parser.add_argument('-s', '--subject-id', metavar='Subject ID', type=str, required=True,
15 help='The subject-id (as string). This number is recorded in the output .mat file.')
16 parser.add_argument('--norms', metavar='filename', type=str, default=norm_file,
17 help='The file that contains the norm data. Defaults to %s.' % norm_file)
18 parser.add_argument('-o', '--output', metavar='filename', type=str, default='results.mat',
19 help='The file to write the results to; should end in .mat. Defaults to results.mat')
20 parser.add_argument('-v', '--verbose', action='store_true',
21 help='Whether to show a progress bar')
22 parser.add_argument('-b', '--break-after', metavar='N', type=int, default=-1,
23 help='Break after N iterations (useful for testing)')
24 parser.add_argument('-n', '--n_voxels', metavar='N voxels', type=int, default=500,
25 help='Number of voxels. Used only for results file name.')
26 parser.add_argument('-d', '--distance-metric', type=str, default='cosine',
27 help=("The distance metric to use. Any distance implemented in SciPy's "
28 "spatial.distance module is supported. See the docstring of "
29 "scipy.spatial.distance.pdict for the exhaustive list of possitble "
30 "metrics. Here are some of the more useful ones: "
31 "'euclidean' - Euclidean distance "
32 "'sqeuclidean' - Squared euclidean distance "
33 "'correlation' - Pearson correlation "
34 "'cosine' - Cosine similarity (the default)"))
35 args = parser.parse_args()
36
37 verbose = args.verbose
38 if args.break_after > 0:
39 break_after = args.break_after
40 else:
41 break_after = None
42
43 print('Subject:', args.subject_id)
44 print('Input:', args.input_file)
45 print('Output:', args.output)
46 print('Norms:', args.norms)
47 print('Distance metric:', args.distance_metric)
48
49
50 m = loadmat(args.input_file)
51 if 'brainVecsReps' in m:
52 # File without stability selection enabled
53 print('Stability selection DISABLED')
54 X = np.array([m['brainVecsReps'][0][i] for i in range(m['brainVecsReps'][0].shape[0])])
55 n_repetitions, n_stimuli, n_voxels = X.shape
56 voxel_ids = []
57
58 # Drop all voxels that contain NaN's for any items
59 non_nan_mask = ~np.any(np.any(np.isnan(X), axis=1), axis=0)
60 non_nan_indices = np.flatnonzero(non_nan_mask)
61 X = X[:, :, non_nan_mask]
62
63 # Normalize betas across items
64 X = zscore(X, axis=1, ddof=1)
65
66 # Average over the repetitions
67 X = X.mean(axis=0)
68
69 X_perm = None
70 splits = None
71
72 elif 'mask_voxels' in m:
73 # File without stability selection enabled
74 print('Stability selection DISABLED')
75 X = m['mask_voxels']
76 voxel_ids = m['voxel_ids']
77 n_stimuli, n_voxels = X.shape
78 X_perm = None
79 splits = None
80
81 elif 'top_voxels_perm' in m:
82 # File with stability selection enabled
83 print('Stability selection ENABLED')
84 X_perm = m['top_voxels_perm']
85 X = m['top_voxels_all']
86 voxel_ids = m['top_voxel_ids']
87 n_stimuli, n_voxels, _ = X_perm.shape
88
89 assert os.path.isfile('leave2out_index.npy')
90 splits = np.load('leave2out_index.npy')
91
92 elif 'brainVecs' in m:
93 # File with single-trial data
94 print('Stability selection DISABLED, single-trial data')
95 X = m['brainVecs']
96 voxel_ids = m['voxindex']
97 n_stimuli, n_voxels = X.shape
98 X_perm = None
99
100 def generate_splits(n_stimuli, block_size=60):
101 """Generate train-set, test-set splits.
102
103 To save computation time, we don't do the full 360*359/2 iterations.
104 Instead we will do the leave-2-out scheme block-wise and use the rest
105 of the data for training.
106 """
107 assert n_stimuli % block_size == 0
108 n_blocks = n_stimuli // block_size
109 for x in range(n_stimuli):
110 for y in range(x + 1, n_stimuli):
111 # Don't make the model distinguish between duplicate stimuli
112 if x % block_size == y % block_size:
113 continue
114
115 test_set = [x, y]
116 train_set = np.setdiff1d(np.arange(n_stimuli), test_set)
117
118 # Remove copies of the selected stimuli from the train set
119 train_set = np.setdiff1d(train_set, [i * block_size + (x % block_size) for i in range(n_blocks)])
120 train_set = np.setdiff1d(train_set, [i * block_size + (y % block_size) for i in range(n_blocks)])
121
122 yield train_set, test_set
123
124 splits = generate_splits(n_stimuli)
125
126 else:
127 raise RuntimeError('Could not find any suitable data in the supplied input file.')
128
129 # Load the norm data
130 m = loadmat(args.norms)
131 y = m['newVectors']
132
133 if not np.isfinite(y).all():
134 raise RuntimeError('The norm data contains NaNs or Infs.')
135 if not np.isfinite(X).all():
136 raise RuntimeError('The brain data contains NaNs or Infs.')
137
138 pairwise_accuracies, model, target_scores, predicted_y, patterns = zero_shot_decoding(
139 X, y, X_perm, verbose=verbose, break_after=break_after, metric=args.distance_metric, cv_splits=splits
140 )
141
142 savemat(args.output, {
143 'pairwise_accuracies': pairwise_accuracies,
144 'weights': model.coef_,
145 'feat_scores': target_scores,
146 'subject': args.subject_id,
147 'inputfile': args.input_file,
148 'alphas': model.alpha_,
149 'voxel_ids': voxel_ids,
150 'predicted_y': predicted_y,
151 'patterns': patterns,
152 })
Lessons this code has to teach us
The first thing that went through my head, as it probably went through yours, was: this code is so long and complicated, answering this seemingly simple question is going to take some time to figure out. And I won’t blame you for giving up right then and there. Hunched over my laptop while the metro passed through Ruoholahti, I tried to trace the logic of the script.
First problem: much of the behavior of the script is dictated by the command line arguments. Luckily, their values are saved in the output file, so I could check that they were correct.
Note
Lesson: always error on the side of caution when deciding whether it is worth storing something in the result file.
That brings us to the big if
-statement.
Did the correct branch execute?
Well, that depends on what was in the m
dictionary, which translates to what variables were defined in the MATLAB file used as input to the script.
If we had used the wrong variable name, i.e. brainVecsReps
instead of brainVecs
, when creating the input file, the wrong branch would have executed and the script would have been happily computing the wrong thing.
And we would never know.
If we had used the wrong input file, or the wrong version of the input file, the wrong branch would have executed without any indication that something was wrong.
So many opportunities for small mistakes to lead to a big error.
Note
Lesson: have the user be explicit in what they want to do, so the script can check the user’s intent against the inputs and raise a nice big error if they screwed up. In this case, there should really have been either a command line parameter determining which branch to execute, or even better, this should have been four separate scripts.
In the end I ended up searching the logfile for the line Stability selection DISABLED, single-trial data
which, thankfully, was there, so the correct branch did execute.
Note
Lesson: be liberal with print
-statements (or other logging directives) in your scripts; cherish the resulting logfiles.
I breathed a sigh of relieved as the metro pulled into the central railway station.
This if
-statement is a work of insanity.
What was I thinking determining what the script should be doing based on a mostly random naming scheme of some variables in a MATLAB file?
I got lucky that time.
But from that moment on, I would heed this lesson:
Note