|
|
|
|
|
by frankmcsherry
2024 days ago
|
|
It just comes down to something as simple as: "if I have shown you 1M different things, and now show you one more thing, what do you have to do to tell me whether that one thing is new or not?" If you can keep a hash map of the things you've seen, then it is easy to respond quickly to that one new thing. If you are not allowed to maintain any state, then you don't have a lot of options to efficiently respond to the new thing, and most likely need to re-read the 1M things. That's the benefit of being stateful. There is a cost too, which is that you need to be able to reconstruct your state in the case of a failure, but fortunately things like differential dataflow (built on TD) are effectively deterministic. Also, I suspect "Spark" is a moving target. The original paper described something that was very much a batch processor; they've been trying to fix that since, and perhaps they've made some progress in the intervening years. |
|
I think Spark does that sometimes these days, but I don't know much about the specifics of how and when Spark does it.
Does TD have to keep _everything_ in memory, or can it be strategic in what it keeps and what it evicts?