| [old HPC guy here, you kids get off my lawn!] ... well, then there is the issue of distributing that 2TB data set. I'll get to the Amdahl's law issue in a moment. This is a non-trivial problem. Ok, it is trivial, but its serial in most cases. Unless you start out with a completely distributed data set. And allocate permanent space on those 1000 nodes. So the data has to move once. And you can amortize that across all the runs. In reality, you cant. We have customers using PB of data for their analytics. Even across 1000 nodes, thats still TB scale. Our approach is not radical, its simple. Build a better architecture system, with much higher bandwidth/lower latency interconnects, so data motion can happen at 10-20 GB/s per machine. Then you can walk through your data in 50-100 seconds per machine (our customers do). And if you need to scale up/out, use 100Gb nets, and other things. On Amdahl's law: In its simplest form, the law states that your performance is bound by the serial portion of the computation. If you can drive the parallel portion to zero time, you are still stuck with the serial portion literally bounding your performance. So lets take your example. 1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution. And at 1GbE, you can move 2GB in about 20 seconds (hurray!). But you've got 1000 nodes, so its 20x 1000, or 1/4 day. Remember, the data starts out in one bolus, unless you allocate those machines and their storage permanently. That type of allocation would be cost prohibitive. Ok, use 10GbE. You'll actually get about 2-4 GbE speeds, but ok, So maybe 5-10k seconds to move your data. And your run is deeply in the noise, at 4 seconds. Still not good. For less than the cost of doing this with capable/fast machines in the cloud where you have to keep moving your data back and forth, you could get a simple bloody fast machine, that can handle the data read in 50-100 seconds. Our thesis (ok, tooting our horn now) is that systems architecture matters for high performance large data analytics. Seymour Cray's statement about 2 strong oxen vs 1024 chickens is apt around now. Cheap and deep are great for non-intensive data motion and analytics. Not so much for very data intensive analytics. Again, I am biased, as this is what my company does. |
I find your statements confusing.
The whole point of things like hadoop is that the data is already distributed and the data storage nodes are also computational nodes. So there is no data distribution that takes 1/4 day or even 50-100 seconds. It takes 0 seconds because you just run the computation where the data already is.
My experience with HPC systems is more "Jobs are paused because the shared filesystem is unavailable"