If you're processing data using tools that cannot trace their ancestry directly to some time before 1985, you're probably wasting your own and your colleagues' time.
The picture has now gotten a little fuzzier as this blog post conflates map reduce and YARN and calls them both hadoop. The scala pseudo code is just about exactly what you'd use with spark which runs on YARN.
https://aadrake.com/command-line-tools-can-be-235x-faster-th...
I'll also take this opportunity to plug Make and Drake for manipulating data in a replicable way:
https://bost.ocks.org/mike/make/
https://github.com/Factual/drake
If you're processing data using tools that cannot trace their ancestry directly to some time before 1985, you're probably wasting your own and your colleagues' time.