And then that one person gets hit by a bus and you go out of business. Very-interconnected large-scale systems rarely have failure modes that are as simple as something the dev did/didn't do.
It seems like about half of the postmortems I've seen (public ones for high profile things and private ones where I've worked) have the incident start either when someone pushed a change, or sometime after the change was pushed when the change blew up; this is why change moratoriums are so effective --- when people stop messing with the system, it becomes stable.
Another large portion is power transfer switches failing. Then you have redundant cicso products failing to fail over properly often resulting in 30 seconds-5 minutes of network connectivity and then (if you're reading a postmortem) cascading failures. After that it's one off partial hardware failures where things worked enough to meet healthchecks but not enough to do actual work (my favorites are things like ECC is correcting errors at such a high rate that the system is using 90%+ cpu on servicing machine check exceptions or somehow system booted with 64MB of ram instead of 4 GB and is running from swap, miraculously)
You can obsess about bus factor, or you can hire people who are good at figuring out complex systems with no documentation and if someone leaves, assign someone with good overall system knowledge to their system until you can find a new dedicated person.
Arguing in favor of more than one person per project is not "obsessing" over bus factor lol. I want to be able to take days off, and I want my coworkers to enjoy the same.
The kind of takeaway I'd want to see from your first example is less like "don't do the things we know will cause breakage when we can't tolerate breakage" and more like "develop runtime-gating of new features and a way of sampling or shadowing production traffic onto n+1 builds before they are eligible to become the released build".
I've also had many issues with dodgy hardware of all types forever-circling repair queues in large fleets and never had a satisfying outcome for it either. Hopefully one of these days.
Another large portion is power transfer switches failing. Then you have redundant cicso products failing to fail over properly often resulting in 30 seconds-5 minutes of network connectivity and then (if you're reading a postmortem) cascading failures. After that it's one off partial hardware failures where things worked enough to meet healthchecks but not enough to do actual work (my favorites are things like ECC is correcting errors at such a high rate that the system is using 90%+ cpu on servicing machine check exceptions or somehow system booted with 64MB of ram instead of 4 GB and is running from swap, miraculously)
You can obsess about bus factor, or you can hire people who are good at figuring out complex systems with no documentation and if someone leaves, assign someone with good overall system knowledge to their system until you can find a new dedicated person.