| HN Mirror

Which version of Debezium was it you encountered these issues with? Connection losses should not be a problem with any current version, as the connectors will restart automatically in that case.

Agreed though that monitoring should be in place, so to be notified upon failed connectors early on (could be based on the heartbeat topic, but there's also JMX metrics which can be exposed to Prometheus/Grafana, and/or health checks could be set-up based on the connector's status as exposed via the Kafka Connect REST API).

On the matter of disk growth, there's no silver bullet here. Some people will want to make 100% sure that never ever events are missed, which implies the replication slot must hold onto those WAL segments while it's not read (this is not specific to Debezium btw.). Others may be willing to accept missing events if the slot isn't read long enough, so those WAL segments can be discarded. In recent Postgres versions, a max size (or age, not sure) can be configured for a replication slot, so it's a matter of configuration which behavior you want.

In any case, a connector downtime for longer than say a few hours is something that should show up as an alert, allowing to take action.

Disclaimer: I work on Debezium