Hacker News new | ask | show | jobs
by ploxiln 4047 days ago
According to the post, the pause/lockup could only happen under production queries, and was not directly related to production data.

They did test against the full production data, afterwards, and could not reproduce. Which was what they expected, since the migrations were on empty tables, which just happened to have foreign key constraints against large tables, but no rows for those constraints to actually apply to.

So for them to "fully" test this migration before applying it to production, they would need to be replaying all production queries against the testing database as well, and maybe even test the migration multiple times to get a statistical sense of the possible latencies.

(I've actually done something like that before, but it's not something you do for every little change)

1 comments

Spot on. We ran the migrations against production-sized data beforehand, and everything was fine. We did it again after the downtime, and got the same results.

Most of our queries are fast (tens of milliseconds or lower), which is how we got away without knowing about this for so long. Unsurprisingly, we've been making a bigger effort to eliminate any slow queries we do find lately. ;)

Curious, is the:

  SELECT DISTINCT(email) FROM parent;
The actual query you run in production, or similar enough? That could be a very expensive query with a very large amount of rows, even if indexed. (Large, being in the multi millions, depending on hardware). Might make more sense to run something like that on a reporting slave.
That was a fabricated example, rather than something plucked from our codebase. It's the sort of query we avoid running on our production database.

I agree with your approach though. Our data analysis runs elsewhere.