Hacker News new | ask | show | jobs
by brentjanderson 924 days ago
OP here:

> Can't initialize a replica from a backup

You could, but you're not going to get any of the constant writes happening during the backup. You will have missing writes on the restored system without some kind of replication involved unless you move up to the application layer.

For example, you could update your app to apply dual writes. I'm aware of teams that have replatformed entire applications on to completely different DBs that way (e.g. going from an RDBMS to something completely different like Apache Cassandra).

For our situation, dual-writes seemed more risky than just doing the dirty work of setting up streaming replication using out of the box Postgres features. But, for some teams it could be a better move.

> This isn't "zero downtime"

and

> The article omits details on how consistency was preserved

In the post we go into detail about how we preserved consistency & avoided API downtime, but the gist is that the app was connected to both databases, but not using the new one by default. We then sent a signal to all instances of our app to cut over using Launch Darkly, which maintains a low-latency connection to all instances of our app.

For the first second after that signal, the servers queued up database requests to allow for replication to catch up. This caused a brief spike in latency that was within intentionally calculated tolerances. After that pause, requests flowed as usual but against the new database and the cut over was complete.

We included a force-disconnect against any pending traffic against the old database as well, with a 500 ms timeout. This timeout was much higher than our p99 query times, so no running queries were force terminated. This ensured that the old database's traffic had ceased, and gave replication plenty of time to catch up.

> No mention of a rollback option

Although it didn't make the cut for the blog post, we considered setting up a fallback database on PG 11.9 and replicating the 15.3 database into that third database. If we needed to abort, we could roll forward to this database on the same version.

We opted to not do this after practicing our upgrade procedure multiple times in staging to ensure we could do this successfully. Having practiced the procedure multiple times gave us confidence when it came to performing the cut over. We also used canary deployments in production to verify certain read-only workloads against the database, treating the 15.3 instance as a read replica.

To your point about it being late at night, we intentionally did this in the early evening on a weekend to avoid "fat finger" type mistakes. The cut over was carefully scripted and rehearsed to reduce the risk of human error as well.

In the event that we needed to rollback, the system was also prepared to flip back to the old database in the event of a catastrophic failure. This would have lead to some data loss against the new database, and we were prepared to reconcile any key pieces of the system in that scenario. To minimize the risk of data loss, we paused certain background tasks in the system briefly during the cutover to reduce the number of writes applied against the system. These details didn't make the blog post as we were going for more of the specifics to Postgres and less to Knock-specific considerations. Teams trying to apply this playbook will always need to build their own inventory of risks and seek to mitigate them in a context-dependent way.

Edit: More detail about rollback procedure