|
|
|
|
|
by gfodor
5170 days ago
|
|
Collaborative filtering. I use S3 and EC2 interchangably when it comes to EMR, which is what I presume is what the author is referring to. Most EMR jobs consume and write their data to S3 and use a temporary HDFS cluster for scratch. By and large scratch data ends up being much, much larger than the original inputs, if for no other reason than that needed during the shuffle/sort stage. (I am assuming we are talking about non trivial map reduce jobs here, not word counters, where you have many reduce steps.) it goes without saying there are many applications where user-created functions will generate more data than they consume (combinatorics, etc) Data locality is but one reason to use map reduce. In practice EMR allows you to draw upon elastic computing resources to allow you to process data however you like. It provides developer and cluster isolation and linearly scalable I/O from S3 as well. The author sounds like someone who may have read the academic papers and a few books but hasn't used these tools in practice. |
|
What collaborative filtering algorithm are you using that requires terabytes of intermediate storage for gigabytes of input data?
I'm familiar with most approaches to CF (SVD, gradient descent, etc) and I can't think of any that require large amounts of intermediate storage.
> By and large scratch data ends up being much, much larger than the original inputs, if for no other reason than that needed during the shuffle/sort stage
I can't think of a single practical situation where you couldn't do your sorting online as you progress through the data. Again, the overhead of moving the data to-and-from S3 would be greater than processing the data locally (unless Amazon's LAN is faster than a SATA bus, which is unlikely).
> The author sounds like someone who may have read the academic papers and a few books but hasn't used these tools in practice.
You keep attacking the author in various ad hominem ways, yet you haven't yet provided a single uncontrived example of the small input data, large intermediate data scenario that your argument relies upon.