| In college (for a time) I was a double major in CS and Physics. I found a job as a programmer at a Physics lab, which fit my interests very well. The previous person roughly showed me the ropes for just a few days before she left to go to grad school. The PI started asking me to run some analyses on a raw dataset. Since I was so new at it, I often messed up and had to rerun the whole thing after looking at the output; this was painful because the entire script took a few hours to run. I started poking around to see whether it could be optimized at all. the raw data was divided up into hundreds files from different runs, sensors, etc..., that were each processed independently in sequence, and the results were all combined together into a big array for the final result. Seems reasonable enough. Except this code was all written by scientists, and the combination was done in the "naive" way - after each of data files was processed, a new array was created and the previous results were copied into the new array, as were the results from the current data file. This meant that for the iterations at the end, we roughly needed to have Memory = 2 * Size of final data, which eventually exceeded the amount of physical memory on the machine (and because there were so many data files, it was doing this allocation and copying dozens of times after it used all the RAM). I updated this to pre-allocate the required size at the beginning for a very very easy 3-4 fold improvement in the overall runtime and felt rather proud of myself. |
To his credit once I (as nicely as possible) showed him how to do it with two nested for-loops he clearly felt stupid and conceded the point. He was otherwise a very smart guy and good to work with, but goes to show how we can take our training for granted. Even freshman-level stuff goes over the heads of PhDs, and I'm sure the same would be true if I were to drop into a biochem lab.