Hacker News new | ask | show | jobs
by blhack 3410 days ago
Maybe I'm misunderstanding the problem, but why can't you scale out horizontally?

If the problem is that queries or sets of data might have to jump nodes, couldn't the data be designed in such a way where an assumption is made about what sorts of queries will happen at write?

Optimize so that node spanning is rare, eat the cost when it does happen, and let those 1/n queries disappear into the average.

1 comments

My lab works with multi-terabyte datasets on a regular basis. We have big machines to do machine learning on, but when they're not in used, I can tell you that it's way easier to provision and write a single or multi-threaded script that just loads everything into memory rather than deal with networking and partitioned data.

Imagine the difference between setting up a spark cluster and writing a for loop. For instance, for reasons someone created a 1TB hdf5 file. Luckily, we had a computer with 500GB+ of ram and lots of swap, so instead of having to hack the file apart and figure out how to chunk or parallelize it, we loaded it into memory for a one time batch job and did other useful things in the mean time.