Hacker News new | ask | show | jobs
by asuffield 3909 days ago
(Tedious disclaimer: my opinion, not speaking for my employer, etc)

I'm an SRE at Google, where postmortems are habitual. The thing that jumped out at me here is that a production change was instantaneously pushed globally, instead of being canaried on a fraction of the serving capacity so that problems could be detected. That seems like your big problem here.

(Of course, without knowing how your data storage works, it's difficult to tell how hard it is to fix that.)

1 comments

Yup.

This is one of our few remaining unsharded databases (legacy problems...), so we can't easily canary a fraction of serving capacity. However, one clear remediation we can implement easily is to have our tooling change a replica first, failover to it as primary, and, if problems are detected, quickly fail back to the healthy former primary.

Lesson learned. We'll be doing a review of all of our database tooling to make sure changes are always canaried or easily reversible.