Hacker News new | ask | show | jobs
by dguaraglia 3709 days ago
My 2 cents: I would not recommend basing any new work on MRjob. As someone who inherited and has been maintaining a bunch of code that depends on it, the library seems to be barely maintained, support for VPC is only partial and not very well documented, the auditing tools stopped working quite a while ago and tracking the progress/status of EMR jobs is extremely painful (to be fair, this is more of an issue with Elastic MapReduce than MRJob itself.)

I love the concept and ease of development, but I can't shake the feeling that the infrastructure is so shaky it almost amount to instant technical debt (sorry if this offends anyone, I'm just a dumb customer.)

3 comments

It looks like mrjob development has been re-started, but there was a disconcerting period (nearly two years) without a release.[1] I used it for rinky-dink projects, and it seemed fragile at the time, so I can understand your inclination to divest from it.

[1]: https://github.com/Yelp/mrjob/releases

In case anyone's curious, what happened was that Dave (@davidmarin) and I (@irskep), the mrjob maintainers, left Yelp within about a month of each other. (There's no story there, just coincidence.) There was never any momentum with new maintainers, going by the release history.

But now Dave is working on mrjob regularly again, hence the pace of recent improvements.

Grandparent is correct about the second-class support for non-EMR production Hadoop usage. Like any open source project, the code only works well if a major stakeholder invests in improving it. Few non-EMR users spend much time contributing, so the situation doesn't improve.

Hey guys, for what its worth, MRJob has given us around 3 years of working (if sometimes clunky) EMR, so thanks for that :)
I have the opposite experience with MrJob. Classifying it as an inactive project is demonstrably false. The rest are EMR complaints, I use it on my own Hadoop cluster.
Just read the comment from one of the creators: https://news.ycombinator.com/item?id=11528776
Do you know of any good alternatives? Any way to write MapReduces in python?
It's not quite the same (since it doesn't become a Map-Reduce job) but if you're mostly interested in the programming paradigm/scalability the Python API for Apache Spark might be a good alternative
Yes! Check out dask: http://www.slideshare.net/continuumio

Its free with a permissive license.

It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

This is likely the best answer for those who wish to code within the map/reduce paradigm by hand and would prefer to use python.
BUT WHY

Your performance is going to be complete and utter crap because you're paying for serialization on every single data element.

Dask is higher performance and more pythonic: http://matthewrocklin.com/blog/work/2016/02/22/dask-distribu...

Luigi does decent job. It is relatively easy to start with and powerful enough to do almost anything
I've been using Luigi for a few months, with no complaints. It supports running Python jobs on Hadoop and Spark, but it's not really a MapReduce framework unto itself.

However http://discoproject.org/ might be worth a look as a standalone alternative.

I have used Disco extensively in the past, nothing but good things to say about it. Fast job launch, easy to write, the DFS has been stellar. This was only using Python for job code.
Unfortunately, no. We are slowly moving away to a streaming infrastructure, so I've been mostly trying to "keep it running" until we are done replacing it. Sorry.
Check out dask: http://www.slideshare.net/continuumio

Its free with a permissive license and actively growing.

It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

Andrew Montalenti did a great talk about scaling out Python at Parsely at the last PyData conference: https://www.youtube.com/watch?v=gVBLF0ohcrE

But TBH, after a certain scale you should really be asking whether or not you should be using Python.