| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zfrenchee 3756 days ago
	Do you know of any good alternatives? Any way to write MapReduces in python?

7 comments

ymt123 3756 days ago

It's not quite the same (since it doesn't become a Map-Reduce job) but if you're mostly interested in the programming paradigm/scalability the Python API for Apache Spark might be a good alternative

link

tanlermin 3756 days ago

Yes! Check out dask: http://www.slideshare.net/continuumio

Its free with a permissive license.

It is also capable of native HDFS integration, Yarn etc and can do more complex and granular parallel patterns than just map reduce. Also has a API for distributed dataframes and arrays with linear algebra ops.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

link

_dark_matter_ 3756 days ago

https://hadoop.apache.org/docs/r1.2.1/streaming.html

link

pvnick 3756 days ago

This is likely the best answer for those who wish to code within the map/reduce paradigm by hand and would prefer to use python.

link

pwang 3754 days ago

BUT WHY

Your performance is going to be complete and utter crap because you're paying for serialization on every single data element.

Dask is higher performance and more pythonic: http://matthewrocklin.com/blog/work/2016/02/22/dask-distribu...

link

hamilyon2 3756 days ago

Luigi does decent job. It is relatively easy to start with and powerful enough to do almost anything

link

rch 3755 days ago

I've been using Luigi for a few months, with no complaints. It supports running Python jobs on Hadoop and Spark, but it's not really a MapReduce framework unto itself.

However http://discoproject.org/ might be worth a look as a standalone alternative.

link

sitkack 3755 days ago

I have used Disco extensively in the past, nothing but good things to say about it. Fast job launch, easy to write, the DFS has been stellar. This was only using Python for job code.

link

dguaraglia 3756 days ago

Unfortunately, no. We are slowly moving away to a streaming infrastructure, so I've been mostly trying to "keep it running" until we are done replacing it. Sorry.

link

tanlermin 3756 days ago

Check out dask: http://www.slideshare.net/continuumio

Its free with a permissive license and actively growing.

DISCLAIMER: I don't work for continuum. I just want to see its projects succeed because I was a user will benefit.

link

gshulegaard 3755 days ago

Andrew Montalenti did a great talk about scaling out Python at Parsely at the last PyData conference: https://www.youtube.com/watch?v=gVBLF0ohcrE

But TBH, after a certain scale you should really be asking whether or not you should be using Python.

link

mring33621 3756 days ago

Apache Flink? http://www.kdnuggets.com/2015/11/getting-started-python-apac...

link