Hacker News new | ask | show | jobs
by ChuckMcM 3670 days ago
I like the analysis, basically it says "hey you don't have big data" :-) but that requires a bit more explanation.

The only advantage of clustered systems like Spark, Hadoop, and others is aggregate bandwidth to disk and memory. We know that because Amdahl's law tells us that parallelizing something invariably adds overhead. So from a systems perspective that overhead has to be "paid for" by some other improvement, and we'll see that it is access to data.

If your task is to process a 2TB data set, on a single machine using a 6GBS SATA channel and 2TB of FLASH SSDs you can read in that dataset into memory in 3333 seconds (at 600MB/sec which is very optimistic for our system), process it, and lets say you write out a 200GB reduced data set for another 333 seconds. so, conveniently, an hour of I/O time.

Now you take that same dataset and distribute it evenly across a thousand nodes. Each one then has 2GB of the data on it. Each node can read in their portion of the data set in 3 seconds, process it and write out their reduction in .3 seconds.

You have "paid" for the overhead of parallelization by trading an I/O cost of an hour for an I/O cost of about 4 seconds.

That is when parallel data reduction architectures are better for a problem than a single threaded architecture. And that "betterness" is purely artificial in the sense that you would be better off with a single system that had 1,000 times the I/O bandwidth (cough mainframe cough) than 1,000 systems with the more limited bandwidth. However a 1,000 machines with one SSD it still cheaper buy than one mainframe of similar capability. So if, and its a big if, your algorithm can be expressed as a data map / reduce problem, and your data is large enough to push the cost of getting it into memory to have a look at significantly beyond cost of executing the program, then you can benefit positively by running it on a cluster rather than running it on a local machine.

6 comments

For a 2 TB dataset you could also pay supermicro 50k to get a 40 core 3 TB RAM monster that can keep that whole dataset in RAM. At 50 GB / sec throughput that would keep your query roundtrip time at somewhere around the minute mark. Not quite 3 seconds, but then not quite a thousand nodes either. Of course, rebooting that machine would be awkward.

Still, I think the general rule applies that if you can buy a server that will fit your dataset into RAM, probably you don't need something like Hadoop.

I think this is also a vary salient observation. It is always possible to construct really large systems, right up to mainframes. And if you're asked to do so it helps to take as wide a view on the problem as you can.

So from a technical point of view, 50GB / sec on a single machine vs 600GB / sec on a 1000 node cluster. From a cost perspective, running 1 machine is going to be a lot less than running 1,000 machines.

Consider some other aspects as well. If a machine breaks, and you have only one, you are offline, if one breaks and you have 999 left, you're still 99.9% up and running. If you work in 2TB data sets, how many do you have? One? Two? Twenty? The more you have the more storage you end up putting on a machine, and even with a SAN the ability to move terabytes around is a pain. Then there is the enterprise value of the analysis. How much does the analysis add value to the product you sell? In the paper's example of Page Rank one could argue it really made Google's engine better so a lot of value. In an oil and gas context it might be the difference between finding oil or not, so again high value. But in a twitter 'bot' analysis, killing off all the identified bot accounts might have very little relative value to the overall business.

The bottom line is that none of these sorts of choices can be made in isolation. Looking at the choices through a single lens, whether it is performance, cost, or capability, is rarely sufficient to make the best choice. What is more the best choice may seem like a "bad" choice from an engineering perspective but great from a finance perspective. Similarly a good choice from a finance perspective could be a horrible choice from an engineering perspective.

What is important is to keep in mind the strengths of the various choices available to you, and their weaknesses. Then to select from them based on the current and future requirements for the resulting system.

Idle question: is 50 GB/s a reasonable throughput to expect for such a monster? DDR4 has a peak transfer rate of ~12.8-19.2 GB/s per stick (per https://en.wikipedia.org/wiki/DDR4_SDRAM), so I'd expect quite a bit more bandwidth for predictable accesses - are you using some useful rule-of-thumb, or just unduly pessimistic? ;-)
At NetApp when we were doing scaling analysis in the early 2000's it became clear the memory bandwidth was limited more by the transaction rate of the memory controller than it was the actual available bandwidth of the memory subsystem.

That is because a memory transaction involves "opening" a page, and then "doing the operation", which can be one to several hundred locations long. "Pointer chasing", code that reads in a structure, then deferences a pointer to another structure, then derefences that pointer to still another stucture, Etc. was really hard on the memory subsystem. It burned a lot of memory ops reading relatively small chunks of memory.

Its a great topic in systems architecture and there are a number of papers on it.

Could you suggest some good papers/articles on the topic?
I'd point out that modern SSDs are getting within the an order of magnitude of RAM in terms of speed. If your dataset is larger than what can fit in RAM, you can probably come up with some kind of on-disk storage mechanism, at the cost of speed. Buy a few and RAID them and you can probably get even closer.

Still not 1000 machines, but one cost in this scenario is that you also won't have to pay for the humans to run the network the 1000 machines live on, or the power for 999 machines, or the cooling, or the floorspace, etc.

"RAM is for the stack and SSDs are for heap" is a phrase I'm starting to hear.

How about when that operation is at the core of your business logic and happens multiple time a day? How about when it's a blocking operation?

There are plenty of cases where your rule doesn't hold true, and I don't think that they're that uncommon.

Plus having many machines has the advantage of a higher granularity of control over workload and balancing.

Hey, I'm building a hadoop system myself, so I know there are workloads where the rule doesn't apply. Like all rules there are exceptions.
Amdahl's law is not about overhead - it is true that parallelizing a computation usually has some cost, and you have to make sure what you're parallelizing is a heavy-enough computation that you still win in the presence of that cost. But that's not what Amdahl's law is about.

It is about the limitation of overall performance improvement when improving the performance of a component. If, say, you improve the performance of a piece of your application that only accounts for 10% of the running time, then you're limited by a 10% total performance improvement. This matters in a parallel context because if you have heavily parallelized your application, and it has actually improved performance, you're likely to eventually be limited by the performance of the sequential parts of your application. See https://en.wikipedia.org/wiki/Amdahl%27s_law ; although the last line of the introduction reads like nonsense.

[old HPC guy here, you kids get off my lawn!]

... well, then there is the issue of distributing that 2TB data set. I'll get to the Amdahl's law issue in a moment.

This is a non-trivial problem. Ok, it is trivial, but its serial in most cases. Unless you start out with a completely distributed data set. And allocate permanent space on those 1000 nodes. So the data has to move once. And you can amortize that across all the runs.

In reality, you cant. We have customers using PB of data for their analytics. Even across 1000 nodes, thats still TB scale.

Our approach is not radical, its simple. Build a better architecture system, with much higher bandwidth/lower latency interconnects, so data motion can happen at 10-20 GB/s per machine. Then you can walk through your data in 50-100 seconds per machine (our customers do). And if you need to scale up/out, use 100Gb nets, and other things.

On Amdahl's law: In its simplest form, the law states that your performance is bound by the serial portion of the computation. If you can drive the parallel portion to zero time, you are still stuck with the serial portion literally bounding your performance. So lets take your example.

1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution. And at 1GbE, you can move 2GB in about 20 seconds (hurray!). But you've got 1000 nodes, so its 20x 1000, or 1/4 day. Remember, the data starts out in one bolus, unless you allocate those machines and their storage permanently. That type of allocation would be cost prohibitive.

Ok, use 10GbE. You'll actually get about 2-4 GbE speeds, but ok, So maybe 5-10k seconds to move your data. And your run is deeply in the noise, at 4 seconds.

Still not good.

For less than the cost of doing this with capable/fast machines in the cloud where you have to keep moving your data back and forth, you could get a simple bloody fast machine, that can handle the data read in 50-100 seconds.

Our thesis (ok, tooting our horn now) is that systems architecture matters for high performance large data analytics. Seymour Cray's statement about 2 strong oxen vs 1024 chickens is apt around now.

Cheap and deep are great for non-intensive data motion and analytics. Not so much for very data intensive analytics.

Again, I am biased, as this is what my company does.

> 1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution.

I find your statements confusing.

The whole point of things like hadoop is that the data is already distributed and the data storage nodes are also computational nodes. So there is no data distribution that takes 1/4 day or even 50-100 seconds. It takes 0 seconds because you just run the computation where the data already is.

My experience with HPC systems is more "Jobs are paused because the shared filesystem is unavailable"

Heh (on the jobs paused bits) ... most modern shared file systems are HA (or nearly HA) except when people build them cheaply. And then you get an effective RAID0.

I was pointing out that if you are doing the analytics at AWS or similar on-demand scenario (a common pattern I see people trying/using and eventually rejecting), you have a serial data motion step to distribute data to your data lake before processing. Then you extract your results, decommission all those servers. Rinse, and repeat.

The point is, that for ephermal compute/storage scenarios, you have a set of poorly architected resources tied together in a way that pretty much guarantees you have a large (dominant) serial step before anything goes in parallel.

What we advocate (and have been doing so for more than a decade), are far more capable building blocks of storage+compute. So if you are going to build a system to process a large amount of data, instead of buying 1000 nodes and managing them, buy 20-50 far more capable nodes at a small fraction of the price, and get the same/better performance.

It also doesn't take 0 seconds. There is a distribution/management overhead (very much non-zero), as well as data motion overhead for results (depending upon the nature of the query). When you subdivide the problems finely enough, the query management overhead actually dominates the computation (which was another aspect of the point I made, but it wasn't explicit on my part).

So we are fine with building hadoop/spark/kdb+ systems ... we just want them built out of units that can move 10-20GB/s between storage and processing. Which lets you hit 50-100s per TB of processed data. Which gives you a fighting chance at dealing with PB scale problems in reasonable time frames, which a number of our users have.

Sounds like you are doing the opposite of what joyent are doing: they basically pair (relevant parts of) data with the program (to process that part). And the reduce/aggregate over the result (that's my takeaway from joyent's marketing, anyway).

Which company do you work for? (unless it's a secret for some reason)

Actually Joyent's manta is quite similar in concept to what we've been doing for years (before manta came out). The idea is to build very capable systems and aggregate them. Not a bunch of fairly low end units (like typical AWS/etc). Our argument is that if you are going to build a high performance computing infrastructure, you ought to build it in an architecturally useful manner. The cost to do so is marginally more than "cheap n deep", while the savings (fewer systems needed for very large analytics) is substantial.

My company is Scalable Informatics (http://scalableinformatics.com)

Thank you for clarifying. Re-reading your first comment, in light of your second, I see that that's indeed what you were saying in the first place. But apparently that's not quite what I read :-)
> The only advantage of clustered systems like Spark, Hadoop, and others is aggregate bandwidth to disk and memory.

Also, aggregate network bandwidth. A major use case for clustered processing is that it is MUCH faster to download external data in parallel across a cluster than in parallel on one box. If timeliness is a major use case of the system you're building, this basically requires a cluster, unless you want to end up re-implementing cluster functionality yourself.

> The only advantage of clustered systems like Spark, Hadoop, and others is aggregate bandwidth to disk and memory.

No it isn't. There are plenty of CPU-bound tasks that run much faster if the work is distributed in parallel across multiple machines. We use Hadoop at my company primarily for that reason.

Can you say something more about your workload? Ever since dipping my feet in the MPI pool, I've wondered what kind of problems really lend themselves to running in parallell across multiple machines these days - assuming a single machine has at least 32 hardware threads.

I know there are modelling work loads, like weather forecasting and analysing seismic data - but curious what kind of work you are doing?

This is true but keep in mind that you have to set up a pipeline for collecting the data into HDFS (Storm? batch loading?) & you have to pay for the machines.

So while your analysis is valid, there are more "costs" at play like developer time, cluster maintenance, hardware. I like to play with Spark's ML libraries but am wary about designing projects specifically around them because of this overhead, especially when trying to distribute some API/tech that you'd like others to use.

Not trying to be a downer, I actually wish the choice to go distributed was more of a no-brainer, hah. Would love for some APIs to emerge that could be used locally/distributed transparently without actually having to run a dummy cluster & data migration to run locally.