Hacker News new | ask | show | jobs
by aeyes 925 days ago
There is a better way than fully copying table content one by one which is very I/O heavy and will not work if you have very large tables.

You can create a replication slot, take a snapshot, restore the snapshot to a new instance, advance the LSN and replicate from there - boom, you have a logical replica with all the data. Then you upgrade your logical replica.

This article from Instacart shows how to do it: https://archive.ph/K5ZuJ

If I remember correctly the article has some small errors but I haven't done this in a while and I don't exactly remember what was wrong. But in general the process works, I have done it like this several times upgrading TB-sized instances.

5 comments

> You can create a replication slot, take a snapshot, restore the snapshot to a new instance, advance the LSN and replicate from there - boom, you have a logical replica with all the data. Then you upgrade your logical replica.

This is a great recipe but needs small but important correction. We need to be careful with plugging pg_upgrade in this physical-to-logical replica conversion process: if we first start logical replication and then running pg_upgrade, there are risks of corruption – see discussion in pgsql-hackers https://www.postgresql.org/message-id/flat/20230217075433.u5.... To solve this, we need first to create logical slot, advance the new cluster to slot's LSN position (not starting logical replication yet), then run pg_upgrade, and only then logical replication – when the new cluster is already running on new PG version.

This is exactly how we (Postgres.ai) recently have helped GitLab upgrade multiple multi-TiB clusters under heavy load without any downtime at all (also involving PgBouncer's PAUSE/RESUME) - there will be a talk by Alexander Sosna presented later this week https://www.postgresql.eu/events/pgconfeu2023/schedule/sessi... and there are some plans to publish details about it.

Thank you for linking this insightful discussion.

I am not sure why I never ran into this problem, unfortunately I don't have access to my notes anymore because I no longer work on this.

This approach has solved so many problems for me. I can do full vacuum, I can change integer columns to bigint, I can do major version upgrades, I can even move instances across AWS regions all with minimal downtime.

It's really great to see that people continue to tinker with it and that there are active discussions on the mailing list to keep improving logical replication. It's come a long way since the first implementation. Thanks for your contribution!

OP here: we looked at this and were not confident in manually advancing the LSN as proposed, and detecting any inconsistency if we missed any replication as a result. Table by table seemed more reliable, despite being more painstaking.
As long as you have the correct LSN there is no way for this to go wrong.

If you resume replication with an incorrect LSN replication will break immediately. I have spent way too much time trying to do this on my own before the blog post was written and I have seen it fail over and over again.

To give you more confidence, try with the LSN from the "redo starts at" log message. It looks close but it will always fail.

Sadly this isn't true. Postgres will happily replicate and skip data if you tell it too.

And there have been multiple bugs around logical replication in version ~10-15 that can cause data loss. None of these are directly related to lsn fiddling tho.

Indeed - People In The Know have some cocnerns with this approach: https://ardentperf.com/2021/07/26/postgresql-logical-replica... .

At $work we did use this approach to upgrade a large, high throughput PG database, but to mitigate the risk we did a full checksum of the tables. This worked something like:

    * Set up logical replica, via 'instacart' approach
    * Attach physical replicas to the primary instance and the logical replica, wait for catchup
    * (very) briefly pause writes on the primary, and confirm catchup on the physical replicas
    * pause log replay on the physical replicas
    * resume writes on the primary
    * checksum the data in each physical replica, and compare
This approach required <1s write downtime on the primary for a very comprehensive data validation.
That article covers the basis of how we do upgrades at Instacart, but is quite old. This is a more modern look at how we accomplish the process. We have used this process to upgrade a lot of very large and very active databases successfully.

https://www.instacart.com/company/how-its-made/zero-downtime...

You caught our off by one bug :)