|
|
|
|
|
by blhack
3410 days ago
|
|
Maybe I'm misunderstanding the problem, but why can't you scale out horizontally? If the problem is that queries or sets of data might have to jump nodes, couldn't the data be designed in such a way where an assumption is made about what sorts of queries will happen at write? Optimize so that node spanning is rare, eat the cost when it does happen, and let those 1/n queries disappear into the average. |
|
Imagine the difference between setting up a spark cluster and writing a for loop. For instance, for reasons someone created a 1TB hdf5 file. Luckily, we had a computer with 500GB+ of ram and lots of swap, so instead of having to hack the file apart and figure out how to chunk or parallelize it, we loaded it into memory for a one time batch job and did other useful things in the mean time.