Hacker News new | ask | show | jobs
by pmorici 2899 days ago
That's kind of the point of that post it's pointing out that you should consider weather you actually need to use something like Hadoop and that most people aren't actually working with data sets large enough for it to make sense.
1 comments

> Although Tom was doing the project for fun, often people use Hadoop and other so-called Big Data™ tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques.

"Big Data™" is unnecessary shade.

> One especially under-used approach for data processing is using standard shell tools and commands. The benefits of this approach can be massive, since creating a data pipeline out of shell commands means that all the processing steps can be done in parallel. This is basically like having your own Storm cluster on your local machine.

It is entirely unlike having a Storm cluster on your machine, and trying to do your data processing as chained shell commands will rapidly become cumbersome if you try to do actual complex processing.

Yes, I get that the author is trying to point out that simpler tools can work for many cases, but the tone of the article makes it seem like that author is just generally saying that EMR/Hadoop is bad. He does not acknowledge just how weighted the test he did is against Hadoop or give any indication of what the tipping point is where you actually want to start considering something distributed. This paints a really misleading picture for anyone who does not already know a fair amount about these technologies.