Hacker News new | ask | show | jobs
by simonw 1525 days ago
The best trick I know of for zero-downtime upgrades is to have a read-only mode.

Sure, that's not the same thing as pure zero-downtime but for many applications it's OK to put the entire thing into read-only mode for a few minutes at a well selected time of day.

While it's in read-only mode (so no writes are being accepted) you can spin up a brand new DB server, upgrade it, finish copying data across - do all kinds of big changes. Then you switch read-only mode back off again when you're finished.

I've even worked with a team used this trick to migrate between two data centers without visible end-user downtime.

A trick I've always wanted to try for smaller changes is the ability to "pause" traffic at a load balancer - effectively to have a 5 second period where each incoming HTTP request appears to take 5 seconds longer to return, but actually it's being held by the load balancer until some underlying upgrade has completed.

Depends how much you can get done in 5 seconds though!

1 comments

>The best trick I know of for zero-downtime upgrades is to have a read-only mode.

I've done something similar, although it wasn't about upgrading the database. We needed to not only migrate data between different DB instances, but also between completely different data models (as part of refactoring). We had several options, such as proper replication + schema migration in the target DB, or by making the app itself write to two models at the same time (which would require a multi-stage release). It all sounded overly complex to me and prone to error, due to a lot of asynchronous code/queues running in parallel. I should also mention that our DB is sharded per tenant (i.e. per an organization). What I came up with was much simpler: I wrote a simple script which simply marked a shard read-only (for this feature), transformed and copied data via a simple HTTP interface, then marked it read-write again, and proceeded to the next shard. All other shards were read-write at a given moment. Since the migration window only affected a single shard at any given moment, no one noticed anything: for a tenant, it translated to 1-2 seconds of not being able to save. In case of problems it would also be easier to revert a few shards than the entire database.

I like this approach.

I'm picturing your migration script looping over shards. It flips it to read-only, migrates, then flips back to read-write.

How did the app handle having some shards in read-write mode pre-migration and other shards in read-write post-migration simultaneously.

Yes, it simply looped over shards, we already had a tool to do that.

The app handled it by proxying calls to the new implementation if the shard was marked as "post-migration", the API stayed the same. If it was "in migration", all write operations returned an error. If the state was "pre-migration", it worked as before.

I don't already remember the details but it was something about the event queue or the notification queue which made me prefer this approach over the others. When a shard was in migration, queue processing was also temporarily halted.

Knowing that a shard is completely "frozen" during migration made it much easier to reason about the whole process.