Hacker News new | ask | show | jobs
by Lewisham 4195 days ago
A true root cause would go deeper and ask why is it that an engineer could solely decide to roll out to all slices?

Because writing code which contains a large number of checks and balances is generally orders of magnitude more expensive than human trust/judgment on the Ops team. Reading the postmortem makes me think that this sort of failure could have happened to anyone, and no-one really did anything wrong. The mistake was the blob store config flag not getting flipped, which is just a natural human error. The engineer who did the roll out could have been any of us. Given what he/she knew, he/she thought they had a good soak test (and a couple of weeks is a pretty good soak test) and made a call, similar calls he/she makes a number of times every day. This one didn't pan out.

I would hazard that most companies have a big red rollout button that is reserved for trusted engineers that will do a rollout without all the checks you're requesting.

2 comments

No one is saying that it had to be code. It could be as simple as "talk to another peer or your manager before making the next step".

For critical infrastructure companies there is the usual rule of "four eyes" for roll outs.

So, while it may be the case that most companies will have the trusted person with the keys to the rollout car the more critical the mission gets the higher the levels of human checks are put in.

Maybe that's what the RCA should have said -- we F-ed up designing and managing the rollout process. An engineer just fell victim to it.

Just a second level of approval can be very useful, without requiring orders of magnitude costs. In part because it usually requires that the change be explained in writing to the second approver, and that can often reveal issues.
It's not clear he/she didn't notify a secondary person, who would have likely had the same knowledge he/she did. Given the same knowledge, the same push might well have happened.