Hacker News new | ask | show | jobs
by makapuf 3411 days ago
Amen. Also said as: too big for excel is not big data. See also https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
3 comments

Amen right back at ya! (I love the O'Reilly book cover.) I highly recommend people read your blog post. And there's also this classic:

https://aadrake.com/command-line-tools-can-be-235x-faster-th...

I'll also take this opportunity to plug Make and Drake for manipulating data in a replicable way:

https://bost.ocks.org/mike/make/

https://github.com/Factual/drake

If you're processing data using tools that cannot trace their ancestry directly to some time before 1985, you're probably wasting your own and your colleagues' time.

Just for clarification, I'm not the original blogger. +10 for the other link and using make ! I don't know Drake however.
The picture has now gotten a little fuzzier as this blog post conflates map reduce and YARN and calls them both hadoop. The scala pseudo code is just about exactly what you'd use with spark which runs on YARN.
I think his point is that bloated, over-engineered Big Data systems—whether batch or streaming—are overkill for the vast majority of problems.
There are just many points that don't really apply to stuff like spark or tez that runs on YARN:

ex: Hadoop << SQL, Python Scripts

I completely agree with

Mapreduce << SQL, Python Scripts

I do a lot of my processing on sparkSQL and through RDD transformations as opposed to Mapreduce limiting, slow KV style processing.

Thanks for the link. The replies in it were hilariously obvious.