Hacker News new | ask | show | jobs
by morcus 993 days ago
In college (for a time) I was a double major in CS and Physics. I found a job as a programmer at a Physics lab, which fit my interests very well. The previous person roughly showed me the ropes for just a few days before she left to go to grad school.

The PI started asking me to run some analyses on a raw dataset. Since I was so new at it, I often messed up and had to rerun the whole thing after looking at the output; this was painful because the entire script took a few hours to run.

I started poking around to see whether it could be optimized at all. the raw data was divided up into hundreds files from different runs, sensors, etc..., that were each processed independently in sequence, and the results were all combined together into a big array for the final result. Seems reasonable enough.

Except this code was all written by scientists, and the combination was done in the "naive" way - after each of data files was processed, a new array was created and the previous results were copied into the new array, as were the results from the current data file. This meant that for the iterations at the end, we roughly needed to have Memory = 2 * Size of final data, which eventually exceeded the amount of physical memory on the machine (and because there were so many data files, it was doing this allocation and copying dozens of times after it used all the RAM).

I updated this to pre-allocate the required size at the beginning for a very very easy 3-4 fold improvement in the overall runtime and felt rather proud of myself.

3 comments

Yeah, back in college I worked with a Biochemistry grad student on a group project that involved some coding (I was Computer Engineering). To iterate over a matrix, he used three nested loops with an if-statement to switch between rows and columns. Technically it worked but wildly inefficient, and he was proud of it...

To his credit once I (as nicely as possible) showed him how to do it with two nested for-loops he clearly felt stupid and conceded the point. He was otherwise a very smart guy and good to work with, but goes to show how we can take our training for granted. Even freshman-level stuff goes over the heads of PhDs, and I'm sure the same would be true if I were to drop into a biochem lab.

Similar story - a PI had written some code to from (row, column) indices of the upper triangle of a matrix (made somewhat tricky by excluding the main diagonal) to a linear index. He used a for loop to start from the beginning and count up for an O(n^2) algorithm - I was able to give him an O(1) constant time formula to do the same thing for a rather dramatic speedup.
I ended up needing this so often for graph processing, and for values which might be inexact if using floating point, that I saved the formula in a blog post. https://vladfeinberg.com/2020/03/07/subset-isomorphism.html

The formula can be "oblivious" to the final size of the matrix too, which is helpful if you're doing some sparse ML training on edges (e.g., GNNs).

During my masters thesis in a chemistry lab, I got a side task to look at a data analysis script and make it run faster. It was a "C/C++" code (i.e. procedural C-style code using C++ stdlib for convenience) that read a file line by line and then fed it to a slow processing function, then aggregated the results. It took over a day to run.

Without even looking at the processing function, which I considered some sciency science, I set up pthreads and mutexes on the result array and such to reap almost perfectly linear scaling. So far, so good.

Then I ran a profiler to see what was actually taking so long.

... Uh, why are you spending all this time copying strings back and forth?

Turns out they passed all strings by value. Sprinkling in a few const & here and there got a 1000-fold speedup or such. I felt pretty stupid for my multithreading antics after that.

Having continuity from your previous analysis is a feature though. You can load up one of your objects and a good library should have the exact parameters used to generate stored there.

Also, H5 data formats[0] have been a god-send for scientific computing, due to its ability to inherently make sense of how to store your data. You can have your previous results curried over into your new analysis without doubling your data.

0: https://en.wikipedia.org/wiki/Hierarchical_Data_Format

Could you have achieved the same thing by just delete/free() the old arrays after copying them? I suck at manually allocating memory, have always worked in garbage collected languages where this wouldn't really be an issue.
No, it was actually in a (obscure scientific) garbage collected language. The syntax was roughly: `allOrbitFiles = allOrbitFiles + currentOrbitFiles`.

I believe what was roughly happening under the hood was: 1. Allocate an array `tmp` of size `length of allOrbitFiles` + `length of currentOrbitFiles`. 2. Copy data from `allOrbitFiles` over to `tmp`. 3. Copy data from `currentOrbitFiles` to `tmp` 4. Reassign `allOrbitFiles` to the new array `tmp`. 5. Garbage collect the old `allOrbitFiles`.

So the doubling of memory usage comes after Step 1. I would imagine (but don't know for sure) that this would actually occur in any garbage collected language I'm familiar with as well (Java, Python, Javascript).

Yes it would happen in JS/python/etc as well once (2 * allOrbitFiles) > Memory. Cool fix I never think about my arrays being larger than memory haha.