| HN Mirror

Those MySQL utilities definitely have a lot of interesting stuff in them and will be good for further research, so thanks. After digging up the xtrabackup command used in there I don't see anything obvious that we were missing (we no longer perform xtrabackup commands since we're all RDS now), but will continue to research. It actually looks like some instances may have been as simple as using the wrong slave coords file; apparently xtrabackup_slave_info contains the master coords, not xtrabackup_binlog_info, which records where the slave is in its own binlog. That's a bit counterintuitive.

We've had other problems with innobackupex, though. We were working with Percona support on a case where a backup lock blocked writes to the DB for 40 minutes and never really resolved it. We had to use minimalist locking parameters in several other cases, which may have also contributed to incorrect binlog coordinates. We experienced a variety of other bugs and issues as well, including corrupted database files, and normally had to invoke our Percona Gold contract to get workarounds or patches.

I'll just note that this class of errors doesn't seem possible in any non-MySQL replication system; there is no slave_skip_errors setting in PgSQL. Your slave either has integrity or it doesn't. That's the way slaves should be. A sane database system won't allow a user to write to a slave or to skip replication rows. I'll also note that the band-aids that make MySQL semi-usable are only there because of Percona's efforts. This stuff doesn't make MySQL seem promising, even if there are workarounds for some of the problems.

Uber's problems as described in that video had nothing to do with PgSQL replication. The carnage was caused by running out of disk space. MySQL doesn't behave well when it gets to 0 free space either; I know from experience. It's as much AWS's fault as PgSQL's, because the reason their disk filled up was a change to IAM requirements. He mentions briefly an attempt to hack Pg replication so it would try to resync from a file with a corrupted header, but probably good for him that that didn't work.

Can't speak to complications associated with vacuum as I've never had to deal with super-large PgSQL databases and pgbouncer is indeed annoying.