Hacker News new | ask | show | jobs
by threeseed 2900 days ago
Your comment is actually pretty funny as the entire point of Hadoop was to be able to use commodity, cheap, off the shell PC hardware as opposed to the exotic specifications you mention there.

Except that of course nowadays such hardware is just a couple of clicks away in AWS.

3 comments

Today's "exotic" (which is actually just high-end commodity) is tomorrow's middling.

I'm not sure it's fair to summarize any one thing as "the entire point" of Hadoop, but, as I recall, it was, originally, an open source implementation of Google's Map-Reduce paper. Put another way, it was a way to bring Google's compute strategy to the masses.

That said, the notion that there is "commodity, cheap, off the shelf PC hardware" and "exotic specifications" is completely a false dichotomy, especially in the face of what, for example, Google actually does.

Google goes cheap. Very cheap. It's custom and exotic, just optimized for cost, but not absolute cost per nod, rather the ratio of cost for performance.

That last part is what's missing from every single Hadoop installation I've personally seen (or that anyone I know has personally seen), the maximization of performance for cost. Instead, there's an inexplicable desire to increase node count by using cheaper nodes, no matter the performance.

> Except that of course nowadays such hardware is just a couple of clicks away in AWS.

I'm a bit unclear what the "except" means here. I don't believe AWS has the truly high-end specs available (and never has, historically, so we can reasonably assume it never will). It's also very not-cheap.

The point of hadoop might have been that, but it never actually delievered any real value to most users - it's an abysmal failure from a computing efficiency point of view; here's an example http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...
I've been using Spark for many years going back to 1.0.

It is the foundation technology of almost every data science team around the world. And your misguided post (which for some weird reason only focuses on graph algorithms) doesn't change that. And not sure why you think it's inefficient. We run 30 node, autoscaling clusters which stay close to 100% for most of the time.

> We run 30 node, autoscaling clusters which stay close to 100% for most of the time.

I have exactly zero knowledge on Spark's efficiency as well as zero on how representative graph algorithms are, but I can confidently say that the above statement fails to refute the referenced article's thesis (which, arguably, criticizes assertions just like that).

Just because your implementation scales (even autoscales) to use more compute resources says nothing about its efficiency (overall or even marginal when adding more nodes, i.e. the shape of the curve).

Computer science has struggled with achieving even near linear-scalability ever since the advent of SMP.

Spark is significantly more efficient than Hadoop.

I don’t know about your specific workload, but i’ve seen quite a few Hadoop setups that were at 100% load most of the time, and were replaced by relatively simple non Hadoop based code that used 2% to 10% of the hardware and ran about as fast.

I didn’t spend much time evaluating the “pre”, but at least one workload spent 90% of the 100% on [de]serialization.

It’s not my link, it is Frank McSherry who is commenting in this thread - I hope he can chime in on why he chose this specific example - but it correlates very well with my experience.

No, the point was to be able to process workloads much larger than would fit in memory on a single machine.
Citation needed! :)

Joking snark aside, I'm actually doubtful this is true. Specifically, I don't recall the impetus for Hadoop (or Google's original Map-Reduce, as described in the '04 paper) being an all-in-memory workload.

Despite it being repeatedly brought up in this sub-thread, I maintain that it's a niche use case and that disk-based data processing workloads are far more common.

ETA: Does anyone know of a canonical or early/initial document outlining the purpose, or at least design goals, of Hadoop?