| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mthomas 4790 days ago
	My understanding of hadoop and hdfs is that both input and ourput are to/from disk. Your MR jobs are spun up and pointed to hdfs urls to read input from and they write to hdfs when done. This means that when there is an error, the intermediate computations don't necessarily have to be redone. However there is a trade off. Additionally, I believe that HDFS keeps 3 copies of the data around on 3 different nodes for redundancy. So there is the overhead of that network traffic.

1 comments

Sounds like, with that configuration applied, the in-memory performance difference between Hadoop and Spark should not be nearly as large.