| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ha292 4243 days ago

This is a good effort. I do have some concerns about it.

A true root cause would go deeper and ask why is it that an engineer could solely decide to roll out to all slices ?

The surface-level answer is that Azure platform lacked tooling. Is that the cause or an effect ? I think it is an effect. There are deeper root causes.

Let's ask -- why was it that the design allowed one engineer to effectively bring down Azure ?

We often stop at these RCAs when it gets uncomfortable and it starts to point upwards.

I say this to the engineer who pressed the buttons: Bravo! You did something that exposed a massive hole in Azure which may have very well prevented a much bigger embarrassment.

1 comments

Lewisham 4243 days ago

A true root cause would go deeper and ask why is it that an engineer could solely decide to roll out to all slices?

Because writing code which contains a large number of checks and balances is generally orders of magnitude more expensive than human trust/judgment on the Ops team. Reading the postmortem makes me think that this sort of failure could have happened to anyone, and no-one really did anything wrong. The mistake was the blob store config flag not getting flipped, which is just a natural human error. The engineer who did the roll out could have been any of us. Given what he/she knew, he/she thought they had a good soak test (and a couple of weeks is a pretty good soak test) and made a call, similar calls he/she makes a number of times every day. This one didn't pan out.

I would hazard that most companies have a big red rollout button that is reserved for trusted engineers that will do a rollout without all the checks you're requesting.

link

ha292 4243 days ago

No one is saying that it had to be code. It could be as simple as "talk to another peer or your manager before making the next step".

For critical infrastructure companies there is the usual rule of "four eyes" for roll outs.

So, while it may be the case that most companies will have the trusted person with the keys to the rollout car the more critical the mission gets the higher the levels of human checks are put in.

Maybe that's what the RCA should have said -- we F-ed up designing and managing the rollout process. An engineer just fell victim to it.

link

smackfu 4243 days ago

Just a second level of approval can be very useful, without requiring orders of magnitude costs. In part because it usually requires that the change be explained in writing to the second approver, and that can often reveal issues.

link

Lewisham 4243 days ago

It's not clear he/she didn't notify a secondary person, who would have likely had the same knowledge he/she did. Given the same knowledge, the same push might well have happened.

link