Hacker News new | ask | show | jobs
by mattj 4708 days ago
(original author of mrjob here)

Steve's post is 100% correct. I originally wrote mrjob as an internal tool at yelp out of my frustration with using dumbo for multi-step jobs. Specifically, I found myself writing the same incantation of "wrap a mapper / reducer function with an encoding scheme" over and over again. I tried to add protocol support into dumbo (so you could specify that your job reads json, uses pickle for intermediate data, and writes thrift), but I had a hard time working with the dumbo codebase (disclaimer: I haven't looked at it since, so it might be easy to do this now). I also wanted to represent mappers and reducers as python generators, which makes writing memory-performant steps natural (eg you normally want to rely on the shuffle / sort to perform the hard work of aggregating by key). Finally, I wanted my jobs to be easy to test both from unittest and from the command line - debugging hadoop streaming jobs is way more of a pain in the ass than it should be.