|
|
|
|
|
by gfodor
5170 days ago
|
|
My argument does not rely upon it, it was an example of one of several reasons running map reduce jobs on the AWS cloud have nothing to do with the amount input data you are moving around. I am not going to go off into even more detail about specific jobs I run daily that generate a large amount of itermediate data because unless I paste the source code in this thread and write a paper on it I assume you won't believe me that there is in fact in the space of "all map reduce jobs" jobs that can generate more data than they input. If you write a trivial map reduce job using cascading that has 10 reducers and each reduce step shuffles the data on a different grouping key you will find that Hadoop alone is generating more data than you input. But again, this isn't the point. The point is the author is calling anyone using AWS for map reduce a "cargo cult" based upon an academic argument that the sole purpose of map reduce is to move computation to your data, hence if you copy your data you are missing the point. In practice, the cost of uploading your data to s3 is a footnote compared to the computational flexibility and use cases that become possible once you are able to run arbitrary tranformations on that data via EMR. You keep ignoring my main point and are focused on my simplistic examples, reading way more into them than was intended. |
|
It would be an example if you had backed up your assertion that collaborative filtering required large amounts of intermediate data, but apparently you are unwilling or, more likely, unable to do this.
> I am not going to go off into even more detail about specific jobs I run daily that generate a large amount of itermediate data because unless I paste the source code in this thread and write a paper on it I assume you won't believe me that there is in fact in the space of "all map reduce jobs" jobs that can generate more data than they input.
Even more detail? You haven't given me any detail! You've yet to give me a single example of a practical situation where a task involves much larger amounts of intermediate data than it's input data. I'm asking you to back up your argument, I'm not asking for access to your source code.
> If you write a trivial map reduce job using cascading that has 10 reducers and each reduce step shuffles the data on a different grouping key you will find that Hadoop alone is generating more data than you input
If it's so trivial, why can't you give me a single practical use-case?
> In practice, the cost of uploading your data to s3 is a footnote compared to the computational flexibility and use cases that become possible once you are able to run arbitrary tranformations on that data via EMR
Yes, apparently so many use-cases that you can't provide a single example of one!