|
There are a few differences, the main one between Spark and timely dataflow is that TD operators can be stateful, and so can respond to new rounds of input data in time proportional to the new input data, rather than that plus accumulated state. So, streaming one new record in and seeing how this changes the results of a multi-way join with many other large relations can happen in milliseconds in TD, vs batch systems which will re-read the large inputs as well. This isn't a fundamentally new difference; Flink had this difference from Spark as far back as 2014. There are other differences between Flink and TD that have to do with state sharing and iteration, but I'd crack open the papers and check out the obligatory "related work" sections each should have. For example, here's the first para of the Related Work section from the Naiad paper: > Dataflow Recent systems such as CIEL [30], Spark [42], Spark Streaming [43], and Optimus [19] extend acyclic batch dataflow [15, 18] to allow dynamic modification of the dataflow graph, and thus support iteration and incremental computation without adding cycles to the dataflow. By adopting a batch-computation model, these systems inherit powerful existing techniques including fault tolerance with parallel recovery; in exchange each requires centralized modifications to the dataflow graph, which introduce substantial overhead that Naiad avoids. For example, Spark Streaming can process incremental updates in around one second, while in Section 6 we show that Naiad can iterate and perform incremental updates in tens of milliseconds. |