Did you ever think of testing these migrations against at least a subset of your production data? It sounds like you just wrote the migrations and let them go into the wild without thought.
According to the post, the pause/lockup could only happen under production queries, and was not directly related to production data.
They did test against the full production data, afterwards, and could not reproduce. Which was what they expected, since the migrations were on empty tables, which just happened to have foreign key constraints against large tables, but no rows for those constraints to actually apply to.
So for them to "fully" test this migration before applying it to production, they would need to be replaying all production queries against the testing database as well, and maybe even test the migration multiple times to get a statistical sense of the possible latencies.
(I've actually done something like that before, but it's not something you do for every little change)
Spot on. We ran the migrations against production-sized data beforehand, and everything was fine. We did it again after the downtime, and got the same results.
Most of our queries are fast (tens of milliseconds or lower), which is how we got away without knowing about this for so long. Unsurprisingly, we've been making a bigger effort to eliminate any slow queries we do find lately. ;)
The actual query you run in production, or similar enough? That could be a very expensive query with a very large amount of rows, even if indexed. (Large, being in the multi millions, depending on hardware). Might make more sense to run something like that on a reporting slave.
They did test against the full production data, afterwards, and could not reproduce. Which was what they expected, since the migrations were on empty tables, which just happened to have foreign key constraints against large tables, but no rows for those constraints to actually apply to.
So for them to "fully" test this migration before applying it to production, they would need to be replaying all production queries against the testing database as well, and maybe even test the migration multiple times to get a statistical sense of the possible latencies.
(I've actually done something like that before, but it's not something you do for every little change)