|
|
|
|
|
by james_woods
1977 days ago
|
|
Where and how in dataflow is late data being handled? How can I configure in which ways refinements relate? These questions are the standard "What Where When How" I want to answer and put into code when dealing with streaming data. I was not able to find this in the documentation, but I only spent a few minutes scanning it. https://www.oreilly.com/radar/the-world-beyond-batch-streami... https://www.oreilly.com/radar/the-world-beyond-batch-streami... Also "Materialize" seems not to support needed features like tumbling windows (yet) when dealing with streaming data in SQL: https://arxiv.org/abs/1905.12133 Additionally "Materialize" states in their doc: State is all in totally volatile memory; if materialized dies, so too does all of the data. - this is not true for example for Apache Flink which stores its state in systems like RocksDB. Having SideInputs or seeds is pretty neat, imagine you have two tables of several TiBs or larger. This is also something that "Materialize" currently lacks:
Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data. |
|
As for the data persistence, that's something the underlying approach for the aggregations could handle relatively well with LSM trees [2] (back then, `Aggregation` was called `ValueHistory`).
Along with syncing that state to replicated storage, it should not be a big problem to make it recover quickly from a dead node.
[0]: https://github.com/frankmcsherry/blog/blob/master/posts/2020... [1]: https://github.com/frankmcsherry/blog/blob/master/posts/2018... [2]: https://github.com/TimelyDataflow/differential-dataflow/issu...