|
|
|
|
|
by gfodor
5170 days ago
|
|
The point is that small inputs can have large intermediate scratch datasets and even smaller outputs. Edit: Even more to the point is that being able to do scalable transformations like computing top N CTR on a lot of data with little regard to available computing/network/disk resources is the reason why you would copy your input data to EC2 for processing. If the author has a point to make he failed to do so beyond making himself look like someone who enjoys labeling things he doesn't understand as a "cargo cult." |
|
Perhaps in some rare circumstances (although not the one you cited), however most people use map reduce for aggregation of one form or another, which doesn't require vast amounts of intermediate data unless you are being deliberately inefficient.
> Even more to the point is that being able to do scalable transformations like computing top N CTR on a lot of data with little regard to available computing/network/disk resources is the reason why you would copy your input data to EC2 for processing.
Actually you'd copy it to S3 for processing, and then it would need to be downloaded into EC2 (unless you want to leave your EC2 instances running, which you won't unless you have a large number of shares in Amazon). It's hard to imagine situations where it is faster to move the data across Amazon's LAN, than to simply process it on the machine it's already on.
> If the author has a point to make he failed to do so beyond making himself look like someone who enjoys labeling things he doesn't understand as a "cargo cult."
The author looks like someone pointing out that the original purpose of map-reduce is that you do your computations where your data is, and that moving your data so that you can do map-reduce on it misses the point. The author is correct.
You might have a stronger argument if you could show some common non-contrived situations where there would be a relatively small amount of input data but vast amounts of intermediate data. You haven't yet.