|
|
|
|
|
by Lewisham
4195 days ago
|
|
A true root cause would go deeper and ask why is it that an engineer could solely decide to roll out to all slices? Because writing code which contains a large number of checks and balances is generally orders of magnitude more expensive than human trust/judgment on the Ops team. Reading the postmortem makes me think that this sort of failure could have happened to anyone, and no-one really did anything wrong. The mistake was the blob store config flag not getting flipped, which is just a natural human error. The engineer who did the roll out could have been any of us. Given what he/she knew, he/she thought they had a good soak test (and a couple of weeks is a pretty good soak test) and made a call, similar calls he/she makes a number of times every day. This one didn't pan out. I would hazard that most companies have a big red rollout button that is reserved for trusted engineers that will do a rollout without all the checks you're requesting. |
|
For critical infrastructure companies there is the usual rule of "four eyes" for roll outs.
So, while it may be the case that most companies will have the trusted person with the keys to the rollout car the more critical the mission gets the higher the levels of human checks are put in.
Maybe that's what the RCA should have said -- we F-ed up designing and managing the rollout process. An engineer just fell victim to it.