Hacker News new | ask | show | jobs
by agf 1103 days ago
I've been in situations like this where the cost of killing active queries was lower than the cost of pausing traffic (and having it potentially back up or time out) for the extra time it would take for those queries to finish.

Just because you can wait for them to finish, doesn't mean it's better to when you look at the cutover as a whole.

2 comments

Also, if you asked me to pick my poison: things get partially available / degraded for a long period of time, or there's a blip of full unavailability during a cutover, I'd pick the latter 9 times out of 10. I find people are pretty good about writing code to deal with "does it work y/n" but people are often a lot less good about "does it nominally work but is going so slow it will never complete / other things will time out in unexpected orders before this finishes / etc". Some of the worst incidents I've seen were "partial" outages that spanned a long time period until the right thing could be drained/kicked/whatever.
This took me a long time to accept in my career, but I do believe you've summarized this in a way that rings true for me as well.
Take that one big hit vs death by a thousand cuts!
Ah indeed, this is a trade off.