| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mjr00 570 days ago

> rust-query manages migrations and reads the schema from the database to check that it matches what was defined in the application. If at any point the database schema doesn't match the expected schema, then rust-query will panic with an error message explaining the difference (currently this error is not very pretty).

IMO - this sounds like "tell me you've never operated a real production system before without telling me you've never operated a real production system before."

Shit happens in real life. Even if you have a great deployment pipeline, at some point, you'll need to add a missing index in production fast because a wave of users came in and revealed a shit query. Or your on-call DBA will need to modify a table over the weekend from i32 -> i64 because you ran out of primary key values, and you can't spend the time updating all your code. (in Rust this is dicier, of course, but with something like Python shouldn't cause issues in general.) Or you'll just need to run some operation out of band -- that is, not relying on a migration -- because it what makes sense. Great example is using something like pt-osc[0] to create a temporary table copy and add temporary triggers to an existing table in order to do a zero-downtime copy.

Or maybe you just need to drop and recreate an index because it got corrupted. Shit happens!

Anyway, I really wouldn't recommend a design that relies on your database always agreeing with your codebase 100% of the time. What you should strive for is your codebase being compatible with the database 100% of the time -- that means new columns get added with a default value (or NULL) so inserts work, you don't drop or rename columns or tables without a strict deprecation process (i.e. a rename is really add in db -> add writes to code -> backfill values in db -> remove from code -> remove from db), etc...

But fundamentally panicking because a table has an extra column is crazy. How else would you add a column to a running production system?

[0] https://docs.percona.com/percona-toolkit/pt-online-schema-ch...

4 comments

kelnos 570 days ago

It's a bummer that you've been downvoted, because it really does seem like people here have not operated databases at scale.

I will never claim that we were great at managing databases at Twilio, but often a schema change would take hours, days, or even a week or two to complete. We're taking about tables with hundreds of millions of rows, or more.

We'd start the change on a DB replica. When it would finish, we would have to wait for the replica to catch up with the primary. Then we would bring up new replicas, replicating from the replica with the new schema. Finally that replica would get promoted to primary, with all the old replicas (and the old primary, of course) removed from service, and the new replicas brought in.

Only then could we deploy code that was aware of and used the updated schema. The previous code of course had to ignore unknown columns, and if we ever wanted to drop a column, we had to first deploy code that would stop using that column. Any column type changes would need to be backwards-compatible. If that wasn't possible, we'd have to add a new column and backfill it. Adding indexes would usually be fine without preparatory code changes, but if we wanted to drop an index we'd first have to make sure there were no queries still depending on it.

Even for a "small" schema change that "only" took minutes or a few tens of seconds to complete, we'd still have to use this process. What, do you think we'd shut part or all of a real-time communications platform down while we do a schema change? Of course not.

The idea that the application could or should be in control of this process, or could always be in sync with the database when it came to its understanding of the schema, is impossibly unrealistic.