|
|
|
|
|
by simonw
1525 days ago
|
|
The best trick I know of for zero-downtime upgrades is to have a read-only mode. Sure, that's not the same thing as pure zero-downtime but for many applications it's OK to put the entire thing into read-only mode for a few minutes at a well selected time of day. While it's in read-only mode (so no writes are being accepted) you can spin up a brand new DB server, upgrade it, finish copying data across - do all kinds of big changes. Then you switch read-only mode back off again when you're finished. I've even worked with a team used this trick to migrate between two data centers without visible end-user downtime. A trick I've always wanted to try for smaller changes is the ability to "pause" traffic at a load balancer - effectively to have a 5 second period where each incoming HTTP request appears to take 5 seconds longer to return, but actually it's being held by the load balancer until some underlying upgrade has completed. Depends how much you can get done in 5 seconds though! |
|
I've done something similar, although it wasn't about upgrading the database. We needed to not only migrate data between different DB instances, but also between completely different data models (as part of refactoring). We had several options, such as proper replication + schema migration in the target DB, or by making the app itself write to two models at the same time (which would require a multi-stage release). It all sounded overly complex to me and prone to error, due to a lot of asynchronous code/queues running in parallel. I should also mention that our DB is sharded per tenant (i.e. per an organization). What I came up with was much simpler: I wrote a simple script which simply marked a shard read-only (for this feature), transformed and copied data via a simple HTTP interface, then marked it read-write again, and proceeded to the next shard. All other shards were read-write at a given moment. Since the migration window only affected a single shard at any given moment, no one noticed anything: for a tenant, it translated to 1-2 seconds of not being able to save. In case of problems it would also be easier to revert a few shards than the entire database.