Hacker News new | ask | show | jobs
by pwmtr 691 days ago
Yes, we use physical replication for HA.

There are many reasons that cloud providers don't want to support logical replication;

- It requires giving superuser access to user. Many cloud providers don't want to give that level of privilege. Some cloud providers fork PostgreSQL or write custom extensions to allow managing replication slots without requiring superuser access. However, doing it securely is very difficult. You suddenly open up a new attack vector for various privilege escalation vulnerabilities.

- If user creates a replication slot, but does not consume the changes, it can quickly fill up the disk. I dealt many different failure modes of PostgreSQL, and I can confidently say that disk full cases one of the most problematic/annoying ones to recover from.

- It requires careful management of replication slots in case of fail over. There are extensions or 3rd party tools helping with this though.

So, some cloud providers don't support logical replication and some support it weakly (i.e. don't cover all edge cases).

Thankfully there are some improvements are being done in PostgreSQL core that simplifies failover of logical replication slot (check out this for more information https://www.postgresql.org/docs/17/logical-replication-failo...), but it is still too early.

2 comments

There is also case where logical replication is not 100% complete. One of our application uses LOBs for some reason and can't do logical replication.
Yeah, I've dealt with some of those edge cases on AWS and GCP.

Some examples:

1. I've seen a delay of hours without any messages being sent on the replication protocol, likely due to a large transaction in the WAL not committed when running out of disk space.

2. `PgError.58P02: could not create file \"pg_replslot/<slot_name>/state.tmp\": File exists`

3. `replication slot "..." is active for PID ...`, with some system process holding on to the replication slot.

4. `can no longer get changes from replication slot "<slot_name>". ... This slot has never previously reserved WAL, or it has been invalidated`.

All of these require manual intervention on the server to resolve.

And that's not even taking into account HA/failover, these are just issues with logical replication on a single node. It's still a great feature though, and often worth having to deal with these issues now and then.

Definitely agreed. It is great feature and a building block for many complex features such as multiple primaries, zero down time migrations etc. I'm also quite happy to see with each PG version, it becomes more stable/easy to use.