Hacker News new | ask | show | jobs
by zfrenchee 3709 days ago
Do you know of any good alternatives? Any way to write MapReduces in python?
7 comments

It's not quite the same (since it doesn't become a Map-Reduce job) but if you're mostly interested in the programming paradigm/scalability the Python API for Apache Spark might be a good alternative
Yes! Check out dask: http://www.slideshare.net/continuumio

Its free with a permissive license.

It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

This is likely the best answer for those who wish to code within the map/reduce paradigm by hand and would prefer to use python.
BUT WHY

Your performance is going to be complete and utter crap because you're paying for serialization on every single data element.

Dask is higher performance and more pythonic: http://matthewrocklin.com/blog/work/2016/02/22/dask-distribu...

Luigi does decent job. It is relatively easy to start with and powerful enough to do almost anything
I've been using Luigi for a few months, with no complaints. It supports running Python jobs on Hadoop and Spark, but it's not really a MapReduce framework unto itself.

However http://discoproject.org/ might be worth a look as a standalone alternative.

I have used Disco extensively in the past, nothing but good things to say about it. Fast job launch, easy to write, the DFS has been stellar. This was only using Python for job code.
Unfortunately, no. We are slowly moving away to a streaming infrastructure, so I've been mostly trying to "keep it running" until we are done replacing it. Sorry.
Check out dask: http://www.slideshare.net/continuumio

Its free with a permissive license and actively growing.

It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

Andrew Montalenti did a great talk about scaling out Python at Parsely at the last PyData conference: https://www.youtube.com/watch?v=gVBLF0ohcrE

But TBH, after a certain scale you should really be asking whether or not you should be using Python.