| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by james_woods 1977 days ago

Where and how in dataflow is late data being handled? How can I configure in which ways refinements relate? These questions are the standard "What Where When How" I want to answer and put into code when dealing with streaming data. I was not able to find this in the documentation, but I only spent a few minutes scanning it.

https://www.oreilly.com/radar/the-world-beyond-batch-streami...

Also "Materialize" seems not to support needed features like tumbling windows (yet) when dealing with streaming data in SQL: https://arxiv.org/abs/1905.12133

Additionally "Materialize" states in their doc: State is all in totally volatile memory; if materialized dies, so too does all of the data. - this is not true for example for Apache Flink which stores its state in systems like RocksDB.

Having SideInputs or seeds is pretty neat, imagine you have two tables of several TiBs or larger. This is also something that "Materialize" currently lacks: Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data.

1 comments

namibj 1976 days ago

Late data is very deliberately not handled. The reasoning for that is best available at [0]. Now, there are ways [1] to handle bitemporal data, but they have fairly significant issues in ergonomics and performance, due to the additional work needed to allow the bitemporal aggregations.

As for the data persistence, that's something the underlying approach for the aggregations could handle relatively well with LSM trees [2] (back then, `Aggregation` was called `ValueHistory`).

Along with syncing that state to replicated storage, it should not be a big problem to make it recover quickly from a dead node.

[0]: https://github.com/frankmcsherry/blog/blob/master/posts/2020... [1]: https://github.com/frankmcsherry/blog/blob/master/posts/2018... [2]: https://github.com/TimelyDataflow/differential-dataflow/issu...

link

james_woods 1976 days ago

Taken from [0] If you wanted to use the information above to make decisions, it could often be wrong. Let's say you want to wait for it to be correct; how long do you wait?

I know how long I want to wait, 30 minutes in one of my cases as I know that I've seen 95% of the important data by then. In the streaming world there is _always_ late data so being able to tell what should happen when the rest (5%) arrives is crucial for me.

This differs from use-case to use-case for me and being able to configure this and handling out-of-order data at scale is key for me when selecting a framework for stream processing. Apache Beam and Apache Flink do this very well.

Taken from [1]: Apache Beam has some other approach where you use both and there is some magical timeout and it only works for windows or something and blah blah blah... If any of you all know the details, drop me a note. It obviously only works when you window your data as it needs to fit in memory. The event-time and system-time concept from Beam and Flink are very similar, also the watermark approach. Thank you for sharing the links, For me it is now clearer where the difference lies between differential-dataflow and stream-processing frameworks (which also offer SQL and even ACID conformity!). I'm using Beam/Flink in production and missing out on one of these mentioned points is a deal-breaker for me.

link

jamii 1976 days ago

What do you usually want to happen with late data? In DD you have the option to ignore it at the source but not to update already-emitted results. Is the latter important for you?

link

namibj 1976 days ago

In DDflow, you could also use the `Product` timestamp combinator, and track both the time that event came from, as well as the time you ingested it. You can then make use of the data as soon as the frontier says it's current for the relevant ingestion timestamp, and occasionally advance the frontier for the origin timestamp at the input, so that arrangements can compact historic data. An affected example would be a query that counts "distinct within some time window". It only has to keep that window's `distinct on` values around as long as you can still feed events with timestamps in that window. If you are no longer able to, the values of the `distinct on` become irrelevant for this operator, and only the count for that window needs to be retained.

link

james_woods 1975 days ago

If I have to report transaction (aka money) then yes. I need to update already emitted results. If it's just a log-based metric for internal use then no.

What I would like to have is a choice - and Apache Beam for example lets you choose this.

link