| PostgreSQL is one of the most powerful and reliable pieces of software I've seen run at large scale, major kudos to all the maintainers for the improvements that keep being added. > PostgreSQL 14 extends its performance gains to the vacuuming system, including optimizations for reducing overhead from B-Trees. This release also adds a vacuum "emergency mode" that is designed to prevent transaction ID wraparound Dealing with transaction ID wraparounds in Postgres was one of the most daunting but fun experiences for me as a young SRE. Each time a transaction modifies rows in a PG database, it increments the transaction ID counter. This counter is stored as a 32-bit integer and it's critical to the MVCC transaction semantics - a transaction with a higher ID should not be visible to a transaction with a lower ID. If the value hits 2 billion and wraps around, disaster strikes as past transactions now appear to be in the future. If PG detects it is reaching that point, it complains loudly and eventually stops further writes to the database to prevent data loss. Postgres avoids getting anywhere close to this situation in almost all deployments by performing routine "auto-vacuums" which mark old row versions as "frozen" so they are no longer using up transaction ID slots. However, there are a couple situations where vacuum will not be able to clean up enough row versions. In our case, this was due to long-running transactions that consumed IDs but never finished. Also it is possible but highly inadvisable to disable auto-vacuums. Here is a postmortem from Sentry who had to deal with this leading to downtime: https://blog.sentry.io/2015/07/23/transaction-id-wraparound-... It looks like the new vacuum "emergency mode" functionality starts vacuuming more aggressively when getting closer to the wraparound event, and as with every PG feature highly granular settings are exposed to tweak this behaviour (https://www.postgresql.org/about/featurematrix/detail/360/) |
When I was running my first Postgres cluster (the reddit databases), I had no idea what vacuuming was for. All I knew was that every time it ran it slowed everything down. Being dumb, I didn't bother to read the docs, I just disabled the auto vacuum.
Eventually writes stopped and I had to take a downtime to do a vacuum. Learned a few important lessons that day. I also then set it up to do an aggressive vacuum every day at 3am, which was the beginning of low traffic time, so that the auto-vacuuming didn't have as much work to do during the day.