Hacker News new | ask | show | jobs
by gtrubetskoy 4276 days ago
The strength of Hadoop isn't so much speed but that it's been around and there is a pretty impressive and fairly mature set of projects that comprises the Hadoop ecosystem, from Yarn to Hive, etc. There are still many issues to resolve, and this evolution will continue for decades to come.

The TB sort benchmark is pretty useless to me - I am much more concerned with stability, a vibrant community (which means people, the software they write and institutions using Hadoop in production).

Last time I tinkered with Spark (this was over a year ago) it was so buggy, next to useless, but perhaps things have changed.

Still - the idea that there is some sort of a revolutionary new approach that is paradigm-shifting and is way better than anything before should be viewed with extreme skepticism.

The problem of distributed computing is not a simple one. I remember tinkering with the Linux kernel back in the mid nineties, and 20 years later it still has ways to go to improve.

Twenty years from now it might or might not be Hadoop that is the tool for this sort of thing, we don't know, but I will not take seriously anything or anyone who claims that the "next best thing" is here in 2014.

2 comments

1. Cloudera left M/R for Spark, Mahout left M/R for Spark. Spark community will be huge soon.

2. Yes, Spark was/is buggy.

3. For me Spark is really paradigm shift, next generation framework compared to M/R

Hadoop != M/R, FWIW. M/R support is left in Yarn for backwards compatibility mostly.

If by M/R you mean Hadoop - Cloudera has done no such thing, their largest customer base is Hadoop.

As to "paradigm shift", we're so early in this that I don't think there even is a paradigm to shift.

Sure, "we're 100% behind Impala", "oops, sorry, now it's Spark" - give them a few months and they'll change their mind to something else again. :)
Spark requires Hadoop to run, so this whole Spark vs Hadoop debate makes no sense whatsoever.

There is a place for arguing how effective Map/Reduce is, but it's been known for years that M/R is not the only, nor best general purpose algorithm for solving all problems. More and more tools these days do not use M/R, Spark including, and Spark certainly is no the first tool to provide an alternative to M/R. AFAIK Google has abandoned M/R years ago.

I just don't understand this constant boasting about Spark, it seems very suspicious to me.

> Spark _requires_ Hadoop to run

This is not correct. Spark uses the Hadoop Input/Output API, but you don't need any Hadoop component installed to run Spark, not even HDFS.

You can -- and many companies do -- run Spark on Mesos or on Spark's standalone cluster manager, and use S3 as their storage layer.

> this whole Spark vs Hadoop debate makes no sense whatsoever

If we talk about Hadoop as an ecosystem of tools, then yes, it doesn't make sense to frame Spark as a competitor. Spark is part of that ecosystem.

But if we talk about Hadoop as Hadoop 1 MapReduce or as Hadoop 2 Tez, both of which are execution engines, then it very much makes sense to pit Spark against them as an alternative execution engine.

Granted, Hadoop 1 MapReduce is pretty old compared to Spark, and Tez is still under heavy development, but these are alternatives and not complements to Spark.

(Note: In Hadoop 2, MapReduce is just a framework that uses Tez as its underlying execution engine.)

> I just don't understand this constant boasting about Spark, it seems very suspicious to me.

Suspicious how?

I think Spark's elegant API, unified data processing model, and performance -- all of which are documented very well in demos and benchmarks online -- merit the excitement that you see in the "Big Data" community.

Yes, i think that the debate makes no sense too - for me, Spark is no Hadoop competitor its rather complement.

Spark does not need Hadoop - you can run it also with Mesos or in local mode..

Actually Doug Cutting himself (who created Hadoop) tweeted about this. I guess Spark gets some of his blessing :)

As pointed out in the article multiple times, we are comparing with MR here. We are not comparing with Hadoop as an ecosystem. Spark plays nicely with Hadoop. As a matter of fact, this experiment ran on HDFS.

In terms of vibrant community, Spark is now the largest open source Big Data project by community/contributor count. More than 300 people have contributed code to the project.

I remember Nathan Marz saying that Storm is the most active project on Github about a year ago. ;)
That may well have been true a year ago, but it is not true as of a few weeks ago [1], and hasn't been true since around Spring of this year.

[1] http://youtu.be/zW0Pqfb8ij0?t=3m50s