|
|
|
|
|
by grogenaut
600 days ago
|
|
From what I've seen in Amazon it's pretty consistent that they do not blame the messenger which is what they consider the person who messed up. Usually that person is the last in a long series of decisions that could have prevented the issue, and thus why blame them. That is unless the person is a) acting with malice, b) is repeatedly shown a pattern of willful ignorance. IIRC, when one person took down S3 with a manual command overriding the safeguards the action was not to fire them but to figure out why it was still a manual process without sign off. Say what you will about Amazon culture, the ability to make mistakes or call them out is pretty consistently protected. |
|
It didn't override safeguards, but they sure wanted you to think that something unusual was done as part of the incident. What they executed was a standard operational command. The problem was, the components that that command interacted with had been creaking at the edges for years by that point. It was literally a case of "when", and not "if". All that happened was the command tipped it over the edge in combination with everything else happening as part of normal operational state.
Engineering leadership had repeatedly raised the risk with further up the chain and no one was willing to put headcount to actually mitigating the problem. If blame was to be applied anywhere, it wasn't on the engineer following the run book that gave them a standard operational command to execute with standard values. They did exactly what they were supposed to.
Some credit where it's due, my understanding from folks I knew still in that space, is that S3 leadership started turning things around after that incident and started taking these risks and operational state seriously.