|
|
|
|
|
by nja
1658 days ago
|
|
Debezium is a useful tool, but requires a lot of babysitting. If the DB connection blips or DNS changes (say, if you just rebuilt your prod db), or in some other cases, it'll die and present this exact problem. Fortunately, it's easy to enable a "heartbeat" topic to alert on to make sure it can be restarted before the db disk fills (of course, db size growth alerts are critical too). We've found that it's worth it for most use cases to switch to a vanilla JDBC Kafka Connector with frequent polling. This also allows for cases such as emitting joined data. Other than Debezium, Postgres + Kafka + Kafka Connect builds a pretty stable system for sending data around all our different dbs, apps, and data lakes. |
|
Agreed though that monitoring should be in place, so to be notified upon failed connectors early on (could be based on the heartbeat topic, but there's also JMX metrics which can be exposed to Prometheus/Grafana, and/or health checks could be set-up based on the connector's status as exposed via the Kafka Connect REST API).
On the matter of disk growth, there's no silver bullet here. Some people will want to make 100% sure that never ever events are missed, which implies the replication slot must hold onto those WAL segments while it's not read (this is not specific to Debezium btw.). Others may be willing to accept missing events if the slot isn't read long enough, so those WAL segments can be discarded. In recent Postgres versions, a max size (or age, not sure) can be configured for a replication slot, so it's a matter of configuration which behavior you want.
In any case, a connector downtime for longer than say a few hours is something that should show up as an alert, allowing to take action.
Disclaimer: I work on Debezium