|
|
|
|
|
by ha292
4195 days ago
|
|
This is a good effort. I do have some concerns about it. A true root cause would go deeper and ask why is it that an engineer could solely decide to roll out to all slices ? The surface-level answer is that Azure platform lacked tooling. Is that the cause or an effect ? I think it is an effect. There are deeper root causes. Let's ask -- why was it that the design allowed one engineer to effectively bring down Azure ? We often stop at these RCAs when it gets uncomfortable and it starts to point upwards. I say this to the engineer who pressed the buttons: Bravo! You did something that exposed a massive hole in Azure which may have very well prevented a much bigger embarrassment. |
|
Because writing code which contains a large number of checks and balances is generally orders of magnitude more expensive than human trust/judgment on the Ops team. Reading the postmortem makes me think that this sort of failure could have happened to anyone, and no-one really did anything wrong. The mistake was the blob store config flag not getting flipped, which is just a natural human error. The engineer who did the roll out could have been any of us. Given what he/she knew, he/she thought they had a good soak test (and a couple of weeks is a pretty good soak test) and made a call, similar calls he/she makes a number of times every day. This one didn't pan out.
I would hazard that most companies have a big red rollout button that is reserved for trusted engineers that will do a rollout without all the checks you're requesting.