| HN Mirror

Spark's major benefit comes from storing the intermediate results in-memory instead of storing it in HDFS as Hadoop does. Let's say a certain query needs to run 3 mapreduce jobs A, B, C one after another. In Hadoop, there will be 3 hdfs reads and writes. With spark, there will be only 1 hdfs read (before launching A) and 1 write (after C is completed). In spark, the output of A gets stored in RAM which is read by B and so on until the final write.

The DAG used by spark represents how one job/partition of data depends on another job/partition and what methods (e.g. filter) need to be applied on the parent data to get the child data. This is useful when a node goes down and that portion of data has to be recomputed. Note that users can choose to persist some intermediate results to hdfs to avoid recomputation in case of failure.