| HN Mirror

It seems like about half of the postmortems I've seen (public ones for high profile things and private ones where I've worked) have the incident start either when someone pushed a change, or sometime after the change was pushed when the change blew up; this is why change moratoriums are so effective --- when people stop messing with the system, it becomes stable.

Another large portion is power transfer switches failing. Then you have redundant cicso products failing to fail over properly often resulting in 30 seconds-5 minutes of network connectivity and then (if you're reading a postmortem) cascading failures. After that it's one off partial hardware failures where things worked enough to meet healthchecks but not enough to do actual work (my favorites are things like ECC is correcting errors at such a high rate that the system is using 90%+ cpu on servicing machine check exceptions or somehow system booted with 64MB of ram instead of 4 GB and is running from swap, miraculously)

You can obsess about bus factor, or you can hire people who are good at figuring out complex systems with no documentation and if someone leaves, assign someone with good overall system knowledge to their system until you can find a new dedicated person.