| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by benesch 1355 days ago

The biggest problem we've encountered with existing tools in the Kafka ecosystem (and the homegrown solutions that we've seen) is that nearly all of them sacrifice consistency. Debezium and most other Kafka Connect plugins will produce duplicate records upon restart, for example, that are very difficult to correctly deduplicate downstream. Things look right when you first turn on the plugin, but a week later when your Kafka Connect cluster restarts, a bit of incorrectness seeps in.

Materialize, by contrast, has been explicitly designed to preserve the consistency present in your upstream system. Our PostgreSQL source, for example, ensures that transactions committed to PostgreSQL appear atomically in Materialize, even when those transactions span multiple tables. See our "consistency guarantees" docs for some more information [0]. We have some additional features coming down the pipe, too, like allowing you to guarantee that your queries against Materialize reflect the latest data in your upstream sources [1].

[0]: https://materialize.com/docs/unstable/overview/isolation-lev...

[1]: https://github.com/MaterializeInc/materialize/issues/11531

1 comments

mozinator 1355 days ago

You can achieve consistency using a transactional outbox and "homegrown" solutions the following way.

Make sure postgresql is configured with `synchronous_commit = remote_apply`

* Create a postgresql logical replication slot which creates a postgresql snapshot in time.

* Start a repeatable read transaction with the snapshot id

* Store all relevant data from the snapshot in sqlite / kv store

* Start listening for WAL changes ( json or protobufs )

* Receive WAL change, mark to postgresql the "write" position of the slot

* Process the data and query all relevant data for materialization from sqlite/kv

* Send data to elasticsearch

* Mark to postgresql the "flush" and "apply" position of the slot

This way you achieve consistency using "homegrown" or Kafka connect possibly too.

link

revertts 1355 days ago

Failures while communicating to the external systems (the kv store and elastic in your example) are usually where this falls down. It's easy to build a system that's consistent ~90% of the time, but if you want to build a system where things like failures during snapshot write or failures during export to elastic are handled properly it starts getting complex (you will need to find ways to recover and retract data, or build smarts into the consumer to query around aborts, or find a way to do a 2PC-esque dance with the external system a la Kafka's transaction support, etc.). Getting to full consistency isn't easy.

link

mozinator 1355 days ago

This has been my experience too. Instead of going the logical replication route I tend to leverage the transactional outbox to achieve consistency in the application layer instead.

So when I transact data into tables I immediately fetch the latest outbox id.

And then when query from Elasticsearch I first fetch what the last outbox id of the processed data is.

This way I know if the transaction was already processed into Elasticsearch or not. Repeat. Until outbox id of Elasticsearch is equal or higher than the outbox id of the mutation.

This way I don't have to use logical replication, no k/v store and I can just use a script that fetches and processes the latest outbox changes on a loop.

link

mozinator 1355 days ago

Looked in the source of Materialize and it looks like this is exactly what they are doing.

They are using Debezium + Kafka for receiving the WAL changes. And using send the processed WAL offsets back using a Kafka topic to Debezium + Postges. This way they can achieve consistency

link