|
|
|
|
|
by sb8244
660 days ago
|
|
Some of the worst mistakes that I saw were from over-reaction in an active incident. One of my programming mantras is "no black magic." If I don't understand why something works, then it's not done. I take this same approach to an incident. If someone can't coherently identify why their suggestion will have an impact, I don't think they should do it. Now there may come a time that you need to just pull the trigger on something, but as I think back I'm not sure that was ever the case in the end. It was wild to see the top brass—normally very cool and composed—start suggesting arbitrary potential fixes during an incident. |
|
I'm willing to throw shit at the wall early in the triaging process, but only when they are low-impact and "simple" things. stuff like -
have we tried clearing cache?
have we checked DNS resolver for errors?
have we restarted the server?
etc. I try to find the "dumb" problems before jumping to some wild fix. In one of the worst outages of my career, a team I was working for tried to do a full database restore, which had never been done in production, based on a guess. At 3am on a saturday. I push back really hard at stuff like that.