| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dguaraglia 3756 days ago
	My 2 cents: I would not recommend basing any new work on MRjob. As someone who inherited and has been maintaining a bunch of code that depends on it, the library seems to be barely maintained, support for VPC is only partial and not very well documented, the auditing tools stopped working quite a while ago and tracking the progress/status of EMR jobs is extremely painful (to be fair, this is more of an issue with Elastic MapReduce than MRJob itself.) I love the concept and ease of development, but I can't shake the feeling that the infrastructure is so shaky it almost amount to instant technical debt (sorry if this offends anyone, I'm just a dumb customer.)

3 comments

__derek__ 3756 days ago

It looks like mrjob development has been re-started, but there was a disconcerting period (nearly two years) without a release.[1] I used it for rinky-dink projects, and it seemed fragile at the time, so I can understand your inclination to divest from it.

[1]: https://github.com/Yelp/mrjob/releases

stevejohnson 3756 days ago

In case anyone's curious, what happened was that Dave (@davidmarin) and I (@irskep), the mrjob maintainers, left Yelp within about a month of each other. (There's no story there, just coincidence.) There was never any momentum with new maintainers, going by the release history.

But now Dave is working on mrjob regularly again, hence the pace of recent improvements.

Grandparent is correct about the second-class support for non-EMR production Hadoop usage. Like any open source project, the code only works well if a major stakeholder invests in improving it. Few non-EMR users spend much time contributing, so the situation doesn't improve.

dguaraglia 3756 days ago

Hey guys, for what its worth, MRJob has given us around 3 years of working (if sometimes clunky) EMR, so thanks for that :)

gdulli 3755 days ago

I have the opposite experience with MrJob. Classifying it as an inactive project is demonstrably false. The rest are EMR complaints, I use it on my own Hadoop cluster.

dguaraglia 3753 days ago

Just read the comment from one of the creators: https://news.ycombinator.com/item?id=11528776

zfrenchee 3756 days ago

Do you know of any good alternatives? Any way to write MapReduces in python?

ymt123 3756 days ago

It's not quite the same (since it doesn't become a Map-Reduce job) but if you're mostly interested in the programming paradigm/scalability the Python API for Apache Spark might be a good alternative

tanlermin 3756 days ago

Yes! Check out dask: http://www.slideshare.net/continuumio

Its free with a permissive license.

It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

_dark_matter_ 3756 days ago

https://hadoop.apache.org/docs/r1.2.1/streaming.html

pvnick 3756 days ago

This is likely the best answer for those who wish to code within the map/reduce paradigm by hand and would prefer to use python.

pwang 3754 days ago

BUT WHY

Your performance is going to be complete and utter crap because you're paying for serialization on every single data element.

Dask is higher performance and more pythonic: http://matthewrocklin.com/blog/work/2016/02/22/dask-distribu...

hamilyon2 3756 days ago

Luigi does decent job. It is relatively easy to start with and powerful enough to do almost anything

rch 3756 days ago

I've been using Luigi for a few months, with no complaints. It supports running Python jobs on Hadoop and Spark, but it's not really a MapReduce framework unto itself.

However http://discoproject.org/ might be worth a look as a standalone alternative.

sitkack 3756 days ago

I have used Disco extensively in the past, nothing but good things to say about it. Fast job launch, easy to write, the DFS has been stellar. This was only using Python for job code.

dguaraglia 3756 days ago

Unfortunately, no. We are slowly moving away to a streaming infrastructure, so I've been mostly trying to "keep it running" until we are done replacing it. Sorry.

tanlermin 3756 days ago

Check out dask: http://www.slideshare.net/continuumio

Its free with a permissive license and actively growing.

It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

gshulegaard 3756 days ago

Andrew Montalenti did a great talk about scaling out Python at Parsely at the last PyData conference: https://www.youtube.com/watch?v=gVBLF0ohcrE

But TBH, after a certain scale you should really be asking whether or not you should be using Python.

mring33621 3756 days ago

Apache Flink? http://www.kdnuggets.com/2015/11/getting-started-python-apac...