Hacker News new | ask | show | jobs
by mthomas 4743 days ago
My understanding of hadoop and hdfs is that both input and ourput are to/from disk. Your MR jobs are spun up and pointed to hdfs urls to read input from and they write to hdfs when done. This means that when there is an error, the intermediate computations don't necessarily have to be redone. However there is a trade off.

Additionally, I believe that HDFS keeps 3 copies of the data around on 3 different nodes for redundancy. So there is the overhead of that network traffic.

1 comments

I did find this (now-fixed) bug/enhancement: https://issues.apache.org/jira/browse/HDFS-2246

Sounds like, with that configuration applied, the in-memory performance difference between Hadoop and Spark should not be nearly as large.