Hacker News new | ask | show | jobs
by dboreham 924 days ago
Surprised you can't initialize a replica from a backup. That would have saved all the farting around streaming the old stable DB content to the new server.

Also, this isn't "zero downtime" -- there's a few seconds down time while service cuts over to the new server.

The article omits details on how consistency was preserved -- you can't just point your application at both servers for some period of time, for example. Possibly you can serve reads from both (but not really), but writes absolutely have to be directed to only one server. Article doesn't mention this.

Lastly, there was no mention of a rollback option -- in my experience performing this kind of one-off fork lift on a large amount of data, things sometimes go off the rails late at night. Therefore you always need a plan for how you can revert to the previous step, go to bed with the assurance that service will still be up in the morning. Specifically that is hard if you've already sent write transactions to the new server but for some reason need to cut back to the old one. Data is now inconsistent.

1 comments

OP here:

> Can't initialize a replica from a backup

You could, but you're not going to get any of the constant writes happening during the backup. You will have missing writes on the restored system without some kind of replication involved unless you move up to the application layer.

For example, you could update your app to apply dual writes. I'm aware of teams that have replatformed entire applications on to completely different DBs that way (e.g. going from an RDBMS to something completely different like Apache Cassandra).

For our situation, dual-writes seemed more risky than just doing the dirty work of setting up streaming replication using out of the box Postgres features. But, for some teams it could be a better move.

> This isn't "zero downtime"

and

> The article omits details on how consistency was preserved

In the post we go into detail about how we preserved consistency & avoided API downtime, but the gist is that the app was connected to both databases, but not using the new one by default. We then sent a signal to all instances of our app to cut over using Launch Darkly, which maintains a low-latency connection to all instances of our app.

For the first second after that signal, the servers queued up database requests to allow for replication to catch up. This caused a brief spike in latency that was within intentionally calculated tolerances. After that pause, requests flowed as usual but against the new database and the cut over was complete.

We included a force-disconnect against any pending traffic against the old database as well, with a 500 ms timeout. This timeout was much higher than our p99 query times, so no running queries were force terminated. This ensured that the old database's traffic had ceased, and gave replication plenty of time to catch up.

> No mention of a rollback option

Although it didn't make the cut for the blog post, we considered setting up a fallback database on PG 11.9 and replicating the 15.3 database into that third database. If we needed to abort, we could roll forward to this database on the same version.

We opted to not do this after practicing our upgrade procedure multiple times in staging to ensure we could do this successfully. Having practiced the procedure multiple times gave us confidence when it came to performing the cut over. We also used canary deployments in production to verify certain read-only workloads against the database, treating the 15.3 instance as a read replica.

To your point about it being late at night, we intentionally did this in the early evening on a weekend to avoid "fat finger" type mistakes. The cut over was carefully scripted and rehearsed to reduce the risk of human error as well.

In the event that we needed to rollback, the system was also prepared to flip back to the old database in the event of a catastrophic failure. This would have lead to some data loss against the new database, and we were prepared to reconcile any key pieces of the system in that scenario. To minimize the risk of data loss, we paused certain background tasks in the system briefly during the cutover to reduce the number of writes applied against the system. These details didn't make the blog post as we were going for more of the specifics to Postgres and less to Knock-specific considerations. Teams trying to apply this playbook will always need to build their own inventory of risks and seek to mitigate them in a context-dependent way.

Edit: More detail about rollback procedure