|
|
|
|
|
by schoolornot
1199 days ago
|
|
Which is precisely why I don't understand the purpose to even have postmortems for 95% of outages. If everyone is aware of what went wrong and the issue is unlikely to ever happen again, what is the point? Well, at companies of the size I work with it is to point fingers, make PMs feel more important, and give people talking points. |
|
Because the only way you can make everyone aware is to write it down. Anything else is hearsay.
And going through the process of a thorough postmortem can ensure you do know exactly what went wrong and why, and how you can prevent the same and similar issues from happening again in the future.
Perhaps from this example it serves as documented proof that work on setting up staging databases needs to be prioritised and invested in? Maybe it's that scripts such as this should be reviewed by another engineer before running? Maybe the standard operating procedure is updated so a backup is taken immediately before running any scripts that write to the database? Maybe you create a rule to limit the blast radius in future and do smaller roll outs to 1k users instead of 100k? Maybe scripts should be developed with a dry-run feature?