Hacker News new | ask | show | jobs
by hobbyist 4545 days ago
I often read that spark avoids the costly synchronization required in mapreduce, since it uses DAG's. Can someone explain how is that achieved. If the application so demands that you can launch jobs together, that can be done even with hadoop/mapreduce. If one job requires the output of another, then the job has to wait for synchronization whether its mapreduce or DAG.
2 comments

Spark's major benefit comes from storing the intermediate results in-memory instead of storing it in HDFS as Hadoop does. Let's say a certain query needs to run 3 mapreduce jobs A, B, C one after another. In Hadoop, there will be 3 hdfs reads and writes. With spark, there will be only 1 hdfs read (before launching A) and 1 write (after C is completed). In spark, the output of A gets stored in RAM which is read by B and so on until the final write.

The DAG used by spark represents how one job/partition of data depends on another job/partition and what methods (e.g. filter) need to be applied on the parent data to get the child data. This is useful when a node goes down and that portion of data has to be recomputed. Note that users can choose to persist some intermediate results to hdfs to avoid recomputation in case of failure.

I believe it is because Spark stores data in objects called RDD (Resilient Distributed Dataset) and these are lazily evaluated. I could be wrong though.

http://spark.incubator.apache.org/docs/latest/quick-start.ht...