| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by haggy 2026 days ago
	Can you point me at documentation for the fault tolerance of the system? A huge issue for streaming systems (and largely unsolved AFAIK) is being able to guarantee that counts aren't duplicated when things fail. How does Materialize handle the relevant failure scenarios in order to prevent inaccurate counts/sums/etc?

2 comments

frankmcsherry 2026 days ago

Hi! I work at Materialize.

I think the right starter take is that Materialize is a deterministic compute engine, one that relies on other infrastructure to act as the source of truth for your data. It can pull data out of your RDBMS's binlog, out of Debezium events you've put in to Kafka, out of local files, etc.

On failure and restart, Materialize leans on the ability to return to the assumed source of truth, again a RDBMS + CDC or perhaps Kafka. I don't recommend thinking about Materialize as a place to sink your streaming events at the moment (there is movement in that direction, because the operational overhead of things like Kafka is real).

The main difference is that unlike an OLTP system, Materialize doesn't have to make and persist non-deterministic choices about e.g. which transactions commit and which do not. That makes fault-tolerance a performance feature rather than a correctness feature, at which point there are a few other options as well (e.g. active-active).

Hope this helps!

link

jgraettinger1 2026 days ago

This is a solved problem, for a few years now. The basic trick is to publish "pending" messages to the broker which are ACK'd by a later written message, only after the transaction and all it's effects have been committed to stable storage (somewhere). Meanwhile, you also capture consumption state (e.x. offsets) into the same database and transaction within which you're updating the materialization results of a streaming computation.

Here's [1] a nice blog post from the Kafka folks on how they approached it.

Gazette [2] (I'm the primary architect) also solves in with some different trade-offs: a "thicker" client, but with no head-of-line blocking and reduced end-to-end latency.

Estuary Flow [3], built on Gazette, leverages this to provide exactly-once, incremental map/reduce and materializations into arbitrary databases.

[1]: https://www.confluent.io/blog/exactly-once-semantics-are-pos...

[2]: https://gazette.readthedocs.io/en/latest/architecture-exactl...

[3]: https://estuary.readthedocs.io/en/latest/README.html

link

haggy 2026 days ago

Interesting! I'm going to read into the info you linked. Thanks for the info!

link