| > Collaborative filtering. What collaborative filtering algorithm are you using that requires terabytes of intermediate storage for gigabytes of input data? I'm familiar with most approaches to CF (SVD, gradient descent, etc) and I can't think of any that require large amounts of intermediate storage. > By and large scratch data ends up being much, much larger than the original inputs, if for no other reason than that needed during the shuffle/sort stage I can't think of a single practical situation where you couldn't do your sorting online as you progress through the data. Again, the overhead of moving the data to-and-from S3 would be greater than processing the data locally (unless Amazon's LAN is faster than a SATA bus, which is unlikely). > The author sounds like someone who may have read the academic papers and a few books but hasn't used these tools in practice. You keep attacking the author in various ad hominem ways, yet you haven't yet provided a single uncontrived example of the small input data, large intermediate data scenario that your argument relies upon. |
If you write a trivial map reduce job using cascading that has 10 reducers and each reduce step shuffles the data on a different grouping key you will find that Hadoop alone is generating more data than you input. But again, this isn't the point. The point is the author is calling anyone using AWS for map reduce a "cargo cult" based upon an academic argument that the sole purpose of map reduce is to move computation to your data, hence if you copy your data you are missing the point. In practice, the cost of uploading your data to s3 is a footnote compared to the computational flexibility and use cases that become possible once you are able to run arbitrary tranformations on that data via EMR. You keep ignoring my main point and are focused on my simplistic examples, reading way more into them than was intended.