| HN Mirror

My lab works with multi-terabyte datasets on a regular basis. We have big machines to do machine learning on, but when they're not in used, I can tell you that it's way easier to provision and write a single or multi-threaded script that just loads everything into memory rather than deal with networking and partitioned data.

Imagine the difference between setting up a spark cluster and writing a for loop. For instance, for reasons someone created a 1TB hdf5 file. Luckily, we had a computer with 500GB+ of ram and lots of swap, so instead of having to hack the file apart and figure out how to chunk or parallelize it, we loaded it into memory for a one time batch job and did other useful things in the mean time.