Hacker News new | ask | show | jobs
by nikita2206 1114 days ago
Very interesting to read, especially having done similar migrations it’s nice to see that the same choice is made by bigger players too (in terms of how to carry out this migration).

I was surprised to see that they had to cancel those ~10 queries that were in flight in the moment when they needed to switch over the query traffic. When doing this with ProxySQL, there was an option to: pause all connections such that they can’t create new transactions and queries, while not cancelling running txs/queries, and then wait for all ongoing txs/queries to finish, and then do the switch and unpause.

2 comments

I've been in situations like this where the cost of killing active queries was lower than the cost of pausing traffic (and having it potentially back up or time out) for the extra time it would take for those queries to finish.

Just because you can wait for them to finish, doesn't mean it's better to when you look at the cutover as a whole.

Also, if you asked me to pick my poison: things get partially available / degraded for a long period of time, or there's a blip of full unavailability during a cutover, I'd pick the latter 9 times out of 10. I find people are pretty good about writing code to deal with "does it work y/n" but people are often a lot less good about "does it nominally work but is going so slow it will never complete / other things will time out in unexpected orders before this finishes / etc". Some of the worst incidents I've seen were "partial" outages that spanned a long time period until the right thing could be drained/kicked/whatever.
This took me a long time to accept in my career, but I do believe you've summarized this in a way that rings true for me as well.
Take that one big hit vs death by a thousand cuts!
Ah indeed, this is a trade off.
I was surprised that they went for database partitioning first.

Caching and optimization weren't mentioned at all, but I guess they already maxed out that path