Hacker News new | ask | show | jobs
by jgraettinger1 815 days ago
The key ingredient is leveled reads across topics.

A task which is transforming across multiple joined topics, or is materializing multiple topics to an external system, needs to read across those topics in a coordinated order so that _doesn't_ happen.

Minimally by using wall-clock publication time of the events so that "A before B" relationships are preserved, and ideally using transaction metadata captured from the source system(s) so that transaction boundaries propagate through the data flow.

Essentially, you have to do a streaming data shuffle where you're holding back some records to let other records catch up. We've implemented this in our own platform [1] (I'm co-founder).

You can also attenuate this further by deliberately holding back data from certain streams. This is handy when stream processing because you sometimes want to react when an event _doesn't_ happen. [2] We have an example that generates events when a bike-share bike hasn't moved from it's last parked station in 48 hours, for example, indicating it's probably broke [3]. It's accomplished by statefully joining a stream of rides, with that same stream but delayed 48 hours.

[1] https://www.estuary.dev [2] https://docs.estuary.dev/concepts/derivations/#read-delay [3] https://github.com/estuary/flow/blob/master/examples/citi-bi...