| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by potatoyogurt 2901 days ago
	I agree with your first two points. It's definitely possible to efficiently process quite a bit of data on a single machine, although I think that past a terabyte, there begin to be strong arguments to a distributed approach even if you can theoretically handle it on one machine (scalability if requirements change, resilience to machine failures, etc.). I still disagree about the article itself (even outside of the conclusion), but perhaps I am reading it uncharitably and other people are not getting the same impression. I do feel that it would be easy for someone who is not very familiar with these technologies to get the wrong impression. Misusage does probably go mainly in the other direction (of people overusing Hadoop rather than underusing), though, so maybe that is not so important a concern.

1 comments

mmt 2901 days ago

> I think that past a terabyte, there begin to be strong arguments to a distributed approach even if you can theoretically handle it on one machine

I've elaborated on these in another comment, as well.

Today, drawing the line at a single terabyte is way too early, even for all-in-memory workloads, if only because there exists an almost 4TB AWS instance now. Any smaller than 3.5TB (or whatever RAM is available to applications) is, at best, living in the past.

> scalability if requirements change

This reads as premature optimization, which turns the strong argument into either a weak argument or even an argument against.

Now, if you know or have reasonable certainty that your requirements will change (and will do so faster than, say Moore's Law) and change soon, then that's different. I suspect there are people who think this, but that it's little more than wishful thinking or a delusion as to how large their slice of "web scale" actually is.

> resilience to machine failures

Machine failures just aren't a legitimate consideration for modern, high-end (but still commodity) hardware. You wouldn't bet your whole business on it, of course, but a 1% chance every year of losing an hour or two of batch processing? Sure.

Sadly, the flip side of this is that I see Hadoop clusters being built with such reliable servers, including redudant PSUs and fans, instead of taking full advantage of the resilience at the software level in order to save as much as possible at the hardware level. The original company behind map-reduce is certainly not splurging on hardware.

link

potatoyogurt 2900 days ago

I'm not saying that past a terabyte is a point where you definitely want to use distributed processing, just that at that point, you should really strongly consider it. There is usually a lot of fuzziness around estimates you get about what sort of data volume you'll need to deal with, and it's not uncommon for it to vary by integral factors between days. If you're pushing the limits of what your system can handle without needing a dramatic rearchitecting, then that's a big risk, and it's not necessarily premature to build in the flexibility to have the option of scaling in the future if you need to. If you hit that 4TB and you still need more, it will be a big headache.

I can't really comment on rates of machine failures, but I have seen it happen before, even just for stupid reasons like someone in a data center unplugging a machine.

link

mmt 2900 days ago

Fair enough, for in-memory only, if 1TB is your raw data, by the time it's indexed, it's going to be bigger.

Surely, though, workloads that require in-memory performance are fairly niche, and jumping from there to distributed (even in-memory) seems non-obvious, at best. Why aren't large arrays of fast SSDs a better alternative? The bandwidth is comparable, but the latency is terrible (still comparable to ethernet to a remote node, though?)

What about workloads that don't require fully-in-memory in the first place? If the cutoff is, then hundreds of TB, wouldn't that cover the vast majority of common use cases?

> I can't really comment on rates of machine failures, but I have seen it happen before, even just for stupid reasons like someone in a data center unplugging a machine.

That sort of anecdata isn't very useful, because a human can cause any failure at any layer, including someone stop a whole cluster, which I've seen happen before.

My point about it not being a legitimate concern is that what is now common practice with what is now common equipment means it's uncommon. These practices and equipment had to evolve, but that evolution happened on the order of over a decade ago.

Also, be wary of selection bias. It's very easy to remember the "fire drill" because of the one machine failure, and it makes a much more interesting story to tell that gets passed around and modified enough, eventually sounding like multiple stories and therefore multiple machines. The hundreds of servers that operated unheard and unseen for years, sometimes beyond their specs (e.g. with only only blower out of four still turning and only half-speed at that), get nary a thought, let alone mention.

link