|
|
|
|
|
by justinsaccount
3667 days ago
|
|
> 1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution. I find your statements confusing. The whole point of things like hadoop is that the data is already distributed and the data storage nodes are also computational nodes. So there is no data distribution that takes 1/4 day or even 50-100 seconds. It takes 0 seconds because you just run the computation where the data already is. My experience with HPC systems is more "Jobs are paused because the shared filesystem is unavailable" |
|
I was pointing out that if you are doing the analytics at AWS or similar on-demand scenario (a common pattern I see people trying/using and eventually rejecting), you have a serial data motion step to distribute data to your data lake before processing. Then you extract your results, decommission all those servers. Rinse, and repeat.
The point is, that for ephermal compute/storage scenarios, you have a set of poorly architected resources tied together in a way that pretty much guarantees you have a large (dominant) serial step before anything goes in parallel.
What we advocate (and have been doing so for more than a decade), are far more capable building blocks of storage+compute. So if you are going to build a system to process a large amount of data, instead of buying 1000 nodes and managing them, buy 20-50 far more capable nodes at a small fraction of the price, and get the same/better performance.
It also doesn't take 0 seconds. There is a distribution/management overhead (very much non-zero), as well as data motion overhead for results (depending upon the nature of the query). When you subdivide the problems finely enough, the query management overhead actually dominates the computation (which was another aspect of the point I made, but it wasn't explicit on my part).
So we are fine with building hadoop/spark/kdb+ systems ... we just want them built out of units that can move 10-20GB/s between storage and processing. Which lets you hit 50-100s per TB of processed data. Which gives you a fighting chance at dealing with PB scale problems in reasonable time frames, which a number of our users have.