Hacker News new | ask | show | jobs
by Joeri 3670 days ago
For a 2 TB dataset you could also pay supermicro 50k to get a 40 core 3 TB RAM monster that can keep that whole dataset in RAM. At 50 GB / sec throughput that would keep your query roundtrip time at somewhere around the minute mark. Not quite 3 seconds, but then not quite a thousand nodes either. Of course, rebooting that machine would be awkward.

Still, I think the general rule applies that if you can buy a server that will fit your dataset into RAM, probably you don't need something like Hadoop.

4 comments

I think this is also a vary salient observation. It is always possible to construct really large systems, right up to mainframes. And if you're asked to do so it helps to take as wide a view on the problem as you can.

So from a technical point of view, 50GB / sec on a single machine vs 600GB / sec on a 1000 node cluster. From a cost perspective, running 1 machine is going to be a lot less than running 1,000 machines.

Consider some other aspects as well. If a machine breaks, and you have only one, you are offline, if one breaks and you have 999 left, you're still 99.9% up and running. If you work in 2TB data sets, how many do you have? One? Two? Twenty? The more you have the more storage you end up putting on a machine, and even with a SAN the ability to move terabytes around is a pain. Then there is the enterprise value of the analysis. How much does the analysis add value to the product you sell? In the paper's example of Page Rank one could argue it really made Google's engine better so a lot of value. In an oil and gas context it might be the difference between finding oil or not, so again high value. But in a twitter 'bot' analysis, killing off all the identified bot accounts might have very little relative value to the overall business.

The bottom line is that none of these sorts of choices can be made in isolation. Looking at the choices through a single lens, whether it is performance, cost, or capability, is rarely sufficient to make the best choice. What is more the best choice may seem like a "bad" choice from an engineering perspective but great from a finance perspective. Similarly a good choice from a finance perspective could be a horrible choice from an engineering perspective.

What is important is to keep in mind the strengths of the various choices available to you, and their weaknesses. Then to select from them based on the current and future requirements for the resulting system.

Idle question: is 50 GB/s a reasonable throughput to expect for such a monster? DDR4 has a peak transfer rate of ~12.8-19.2 GB/s per stick (per https://en.wikipedia.org/wiki/DDR4_SDRAM), so I'd expect quite a bit more bandwidth for predictable accesses - are you using some useful rule-of-thumb, or just unduly pessimistic? ;-)
At NetApp when we were doing scaling analysis in the early 2000's it became clear the memory bandwidth was limited more by the transaction rate of the memory controller than it was the actual available bandwidth of the memory subsystem.

That is because a memory transaction involves "opening" a page, and then "doing the operation", which can be one to several hundred locations long. "Pointer chasing", code that reads in a structure, then deferences a pointer to another structure, then derefences that pointer to still another stucture, Etc. was really hard on the memory subsystem. It burned a lot of memory ops reading relatively small chunks of memory.

Its a great topic in systems architecture and there are a number of papers on it.

Could you suggest some good papers/articles on the topic?
I'd point out that modern SSDs are getting within the an order of magnitude of RAM in terms of speed. If your dataset is larger than what can fit in RAM, you can probably come up with some kind of on-disk storage mechanism, at the cost of speed. Buy a few and RAID them and you can probably get even closer.

Still not 1000 machines, but one cost in this scenario is that you also won't have to pay for the humans to run the network the 1000 machines live on, or the power for 999 machines, or the cooling, or the floorspace, etc.

"RAM is for the stack and SSDs are for heap" is a phrase I'm starting to hear.

How about when that operation is at the core of your business logic and happens multiple time a day? How about when it's a blocking operation?

There are plenty of cases where your rule doesn't hold true, and I don't think that they're that uncommon.

Plus having many machines has the advantage of a higher granularity of control over workload and balancing.

Hey, I'm building a hadoop system myself, so I know there are workloads where the rule doesn't apply. Like all rules there are exceptions.