Hacker News new | ask | show | jobs
The map-reduce cargo cult (blog.locut.us)
9 points by sllrpr 5171 days ago
1 comments

Uh, no. The reason you copy your data to EC2 is so you can perform arbitrary transformations on it in a scalable way. I can dump 10GB of data to EC2 and then use 1000 machines to perform some processing, during which there are many terabytes of intermediate data, and an output which is less than a few kilobytes. (For example, computing the 10 most popular search queries and their click through rates over the last year.)

A tip for the author: if you notice a number of your peers doing something, perhaps you should give them the benefit of the doubt and inquire further.

If the best way you can think of to determine the 10 most popular search queries and their click-through rates involves terabytes of intermediate data for only 10GB of input data, then you're doing something very wrong.
The point is that small inputs can have large intermediate scratch datasets and even smaller outputs.

Edit: Even more to the point is that being able to do scalable transformations like computing top N CTR on a lot of data with little regard to available computing/network/disk resources is the reason why you would copy your input data to EC2 for processing. If the author has a point to make he failed to do so beyond making himself look like someone who enjoys labeling things he doesn't understand as a "cargo cult."

> The point is that small inputs can have large intermediate scratch datasets and even smaller outputs.

Perhaps in some rare circumstances (although not the one you cited), however most people use map reduce for aggregation of one form or another, which doesn't require vast amounts of intermediate data unless you are being deliberately inefficient.

> Even more to the point is that being able to do scalable transformations like computing top N CTR on a lot of data with little regard to available computing/network/disk resources is the reason why you would copy your input data to EC2 for processing.

Actually you'd copy it to S3 for processing, and then it would need to be downloaded into EC2 (unless you want to leave your EC2 instances running, which you won't unless you have a large number of shares in Amazon). It's hard to imagine situations where it is faster to move the data across Amazon's LAN, than to simply process it on the machine it's already on.

> If the author has a point to make he failed to do so beyond making himself look like someone who enjoys labeling things he doesn't understand as a "cargo cult."

The author looks like someone pointing out that the original purpose of map-reduce is that you do your computations where your data is, and that moving your data so that you can do map-reduce on it misses the point. The author is correct.

You might have a stronger argument if you could show some common non-contrived situations where there would be a relatively small amount of input data but vast amounts of intermediate data. You haven't yet.

Collaborative filtering.

I use S3 and EC2 interchangably when it comes to EMR, which is what I presume is what the author is referring to. Most EMR jobs consume and write their data to S3 and use a temporary HDFS cluster for scratch. By and large scratch data ends up being much, much larger than the original inputs, if for no other reason than that needed during the shuffle/sort stage. (I am assuming we are talking about non trivial map reduce jobs here, not word counters, where you have many reduce steps.) it goes without saying there are many applications where user-created functions will generate more data than they consume (combinatorics, etc)

Data locality is but one reason to use map reduce. In practice EMR allows you to draw upon elastic computing resources to allow you to process data however you like. It provides developer and cluster isolation and linearly scalable I/O from S3 as well. The author sounds like someone who may have read the academic papers and a few books but hasn't used these tools in practice.

> Collaborative filtering.

What collaborative filtering algorithm are you using that requires terabytes of intermediate storage for gigabytes of input data?

I'm familiar with most approaches to CF (SVD, gradient descent, etc) and I can't think of any that require large amounts of intermediate storage.

> By and large scratch data ends up being much, much larger than the original inputs, if for no other reason than that needed during the shuffle/sort stage

I can't think of a single practical situation where you couldn't do your sorting online as you progress through the data. Again, the overhead of moving the data to-and-from S3 would be greater than processing the data locally (unless Amazon's LAN is faster than a SATA bus, which is unlikely).

> The author sounds like someone who may have read the academic papers and a few books but hasn't used these tools in practice.

You keep attacking the author in various ad hominem ways, yet you haven't yet provided a single uncontrived example of the small input data, large intermediate data scenario that your argument relies upon.