|
|
|
|
|
by hobbyist
4545 days ago
|
|
I often read that spark avoids the costly synchronization required in mapreduce, since it uses DAG's. Can someone explain how is that achieved. If the application so demands that you can launch jobs together, that can be done even with hadoop/mapreduce. If one job requires the output of another, then the job has to wait for synchronization whether its mapreduce or DAG. |
|
The DAG used by spark represents how one job/partition of data depends on another job/partition and what methods (e.g. filter) need to be applied on the parent data to get the child data. This is useful when a node goes down and that portion of data has to be recomputed. Note that users can choose to persist some intermediate results to hdfs to avoid recomputation in case of failure.