Hacker News new | ask | show | jobs
by gen220 2028 days ago
That's a great description! Does materialize describe how they implement timely dataflow?

At my current company, we have built some systems like this. Where a downstream table is essentially a function of a dozen upstream tables.

Whenever one of the upstream tables changes, it's primary key is published to a queue, some worker translates this upstream primary key into a set of downstream primary keys, and publishes these downstream primary keys to a compacted queue.

The compacted queue is read by another worker, that "recomputes" each dirty key, one-at-a-time, which involves fetching the latest-and-greatest version of each upstream table.

This last worker is the bottleneck, but it's optimized by per-key caching, so we only fetch the latest-and-greatest version once per update. It can also be safely and arbitrarily parallelized, since the stream they read from is partitioned on key.

2 comments

> Does materialize describe how they implement timely dataflow?

It's open source (https://github.com/TimelyDataflow/timely-dataflow), and also extensively written about both in academic research papers and documentation for the project itself. The GitHub repo has pointers to all of that. See also differential dataflow (https://github.com/timelydataflow/differential-dataflow).

Here's a 15-minute introduction to Timely Dataflow by Frank, our co-founder: https://www.youtube.com/watch?v=yOnPmVf4YWo