| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by makapuf 3458 days ago
	Amen. Also said as: too big for excel is not big data. See also https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

3 comments

edw 3458 days ago

Amen right back at ya! (I love the O'Reilly book cover.) I highly recommend people read your blog post. And there's also this classic:

https://aadrake.com/command-line-tools-can-be-235x-faster-th...

I'll also take this opportunity to plug Make and Drake for manipulating data in a replicable way:

https://bost.ocks.org/mike/make/

https://github.com/Factual/drake

If you're processing data using tools that cannot trace their ancestry directly to some time before 1985, you're probably wasting your own and your colleagues' time.

link

makapuf 3458 days ago

Just for clarification, I'm not the original blogger. +10 for the other link and using make ! I don't know Drake however.

link

placeybordeaux 3458 days ago

The picture has now gotten a little fuzzier as this blog post conflates map reduce and YARN and calls them both hadoop. The scala pseudo code is just about exactly what you'd use with spark which runs on YARN.

link

edw 3458 days ago

I think his point is that bloated, over-engineered Big Data systems—whether batch or streaming—are overkill for the vast majority of problems.

link

placeybordeaux 3458 days ago

There are just many points that don't really apply to stuff like spark or tez that runs on YARN:

ex: Hadoop << SQL, Python Scripts

I completely agree with

Mapreduce << SQL, Python Scripts

I do a lot of my processing on sparkSQL and through RDD transformations as opposed to Mapreduce limiting, slow KV style processing.

link

nickpsecurity 3458 days ago

Thanks for the link. The replies in it were hilariously obvious.

link