| A personal anecdote: One of my guys made a mistake while deploying some config changes to Production and caused a short outage for a Client. There's a post-incident meeting and the client asks "what are we going to do to prevent this from happening in the future?" - probably wanting to tick some meeting boxes. My response: "Nothing. We're not going to do anything." The entire room (incl. my side) looks at me. What do I mean, "Nothing?!?". I said something like "Look, people make mistakes. This is the first time that this kind of mistake had happened. I could tell people to double-check everything, but then everything will be done twice as slowly. Inventing new policies based on a one-off like this feels like an overreaction to me. For now I'd prefer to close this one as human error - wontfix. If we see a pattern of mistakes being made then we can talk about taking steps to prevent them." In the end the conceded that yeah, the outage wasn't so bad and what I said made sense. Felt a bit proud for pushing back :) |
"Wanting to tick some meeting boxes" feels a bit ungenerous. Ideally, a production outage shouldn't be a single mistake away, and it seems reasonable to suggest adding additional safeguards to prevent that from happening again[1]. Generally, I don't think you need to wait until after multiple incidents to identify and address potential classes of problems.
While it is good and admirable to stand up for your team, I think that creating a safety net that allows your team to make mistakes is just as important.
[1] https://en.wikipedia.org/wiki/Swiss_cheese_model