|
|
|
|
|
by benesch
1355 days ago
|
|
The biggest problem we've encountered with existing tools in the Kafka ecosystem (and the homegrown solutions that we've seen) is that nearly all of them sacrifice consistency. Debezium and most other Kafka Connect plugins will produce duplicate records upon restart, for example, that are very difficult to correctly deduplicate downstream. Things look right when you first turn on the plugin, but a week later when your Kafka Connect cluster restarts, a bit of incorrectness seeps in. Materialize, by contrast, has been explicitly designed to preserve the consistency present in your upstream system. Our PostgreSQL source, for example, ensures that transactions committed to PostgreSQL appear atomically in Materialize, even when those transactions span multiple tables. See our "consistency guarantees" docs for some more information [0]. We have some additional features coming down the pipe, too, like allowing you to guarantee that your queries against Materialize reflect the latest data in your upstream sources [1]. [0]: https://materialize.com/docs/unstable/overview/isolation-lev... [1]: https://github.com/MaterializeInc/materialize/issues/11531 |
|
Make sure postgresql is configured with `synchronous_commit = remote_apply`
* Create a postgresql logical replication slot which creates a postgresql snapshot in time.
* Start a repeatable read transaction with the snapshot id
* Store all relevant data from the snapshot in sqlite / kv store
* Start listening for WAL changes ( json or protobufs )
* Receive WAL change, mark to postgresql the "write" position of the slot
* Process the data and query all relevant data for materialization from sqlite/kv
* Send data to elasticsearch
* Mark to postgresql the "flush" and "apply" position of the slot
This way you achieve consistency using "homegrown" or Kafka connect possibly too.