|
|
|
|
|
by gen220
2028 days ago
|
|
That's a great description! Does materialize describe how they implement timely dataflow? At my current company, we have built some systems like this. Where a downstream table is essentially a function of a dozen upstream tables. Whenever one of the upstream tables changes, it's primary key is published to a queue, some worker translates this upstream primary key into a set of downstream primary keys,
and publishes these downstream primary keys to a compacted queue. The compacted queue is read by another worker, that "recomputes" each dirty key, one-at-a-time, which involves fetching the latest-and-greatest version of each upstream table. This last worker is the bottleneck, but it's optimized by per-key caching, so we only fetch the latest-and-greatest version once per update. It can also be safely and arbitrarily parallelized, since the stream they read from is partitioned on key. |
|
It's open source (https://github.com/TimelyDataflow/timely-dataflow), and also extensively written about both in academic research papers and documentation for the project itself. The GitHub repo has pointers to all of that. See also differential dataflow (https://github.com/timelydataflow/differential-dataflow).