Hacker News new | ask | show | jobs
by sijoe 3665 days ago
[old HPC guy here, you kids get off my lawn!]

... well, then there is the issue of distributing that 2TB data set. I'll get to the Amdahl's law issue in a moment.

This is a non-trivial problem. Ok, it is trivial, but its serial in most cases. Unless you start out with a completely distributed data set. And allocate permanent space on those 1000 nodes. So the data has to move once. And you can amortize that across all the runs.

In reality, you cant. We have customers using PB of data for their analytics. Even across 1000 nodes, thats still TB scale.

Our approach is not radical, its simple. Build a better architecture system, with much higher bandwidth/lower latency interconnects, so data motion can happen at 10-20 GB/s per machine. Then you can walk through your data in 50-100 seconds per machine (our customers do). And if you need to scale up/out, use 100Gb nets, and other things.

On Amdahl's law: In its simplest form, the law states that your performance is bound by the serial portion of the computation. If you can drive the parallel portion to zero time, you are still stuck with the serial portion literally bounding your performance. So lets take your example.

1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution. And at 1GbE, you can move 2GB in about 20 seconds (hurray!). But you've got 1000 nodes, so its 20x 1000, or 1/4 day. Remember, the data starts out in one bolus, unless you allocate those machines and their storage permanently. That type of allocation would be cost prohibitive.

Ok, use 10GbE. You'll actually get about 2-4 GbE speeds, but ok, So maybe 5-10k seconds to move your data. And your run is deeply in the noise, at 4 seconds.

Still not good.

For less than the cost of doing this with capable/fast machines in the cloud where you have to keep moving your data back and forth, you could get a simple bloody fast machine, that can handle the data read in 50-100 seconds.

Our thesis (ok, tooting our horn now) is that systems architecture matters for high performance large data analytics. Seymour Cray's statement about 2 strong oxen vs 1024 chickens is apt around now.

Cheap and deep are great for non-intensive data motion and analytics. Not so much for very data intensive analytics.

Again, I am biased, as this is what my company does.

2 comments

> 1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution.

I find your statements confusing.

The whole point of things like hadoop is that the data is already distributed and the data storage nodes are also computational nodes. So there is no data distribution that takes 1/4 day or even 50-100 seconds. It takes 0 seconds because you just run the computation where the data already is.

My experience with HPC systems is more "Jobs are paused because the shared filesystem is unavailable"

Heh (on the jobs paused bits) ... most modern shared file systems are HA (or nearly HA) except when people build them cheaply. And then you get an effective RAID0.

I was pointing out that if you are doing the analytics at AWS or similar on-demand scenario (a common pattern I see people trying/using and eventually rejecting), you have a serial data motion step to distribute data to your data lake before processing. Then you extract your results, decommission all those servers. Rinse, and repeat.

The point is, that for ephermal compute/storage scenarios, you have a set of poorly architected resources tied together in a way that pretty much guarantees you have a large (dominant) serial step before anything goes in parallel.

What we advocate (and have been doing so for more than a decade), are far more capable building blocks of storage+compute. So if you are going to build a system to process a large amount of data, instead of buying 1000 nodes and managing them, buy 20-50 far more capable nodes at a small fraction of the price, and get the same/better performance.

It also doesn't take 0 seconds. There is a distribution/management overhead (very much non-zero), as well as data motion overhead for results (depending upon the nature of the query). When you subdivide the problems finely enough, the query management overhead actually dominates the computation (which was another aspect of the point I made, but it wasn't explicit on my part).

So we are fine with building hadoop/spark/kdb+ systems ... we just want them built out of units that can move 10-20GB/s between storage and processing. Which lets you hit 50-100s per TB of processed data. Which gives you a fighting chance at dealing with PB scale problems in reasonable time frames, which a number of our users have.

Sounds like you are doing the opposite of what joyent are doing: they basically pair (relevant parts of) data with the program (to process that part). And the reduce/aggregate over the result (that's my takeaway from joyent's marketing, anyway).

Which company do you work for? (unless it's a secret for some reason)

Actually Joyent's manta is quite similar in concept to what we've been doing for years (before manta came out). The idea is to build very capable systems and aggregate them. Not a bunch of fairly low end units (like typical AWS/etc). Our argument is that if you are going to build a high performance computing infrastructure, you ought to build it in an architecturally useful manner. The cost to do so is marginally more than "cheap n deep", while the savings (fewer systems needed for very large analytics) is substantial.

My company is Scalable Informatics (http://scalableinformatics.com)

Thank you for clarifying. Re-reading your first comment, in light of your second, I see that that's indeed what you were saying in the first place. But apparently that's not quite what I read :-)