| HN Mirror

You don't get to major failures once a decade (or less) on systems this complex without understanding and in fact being on the cutting edge (likely ahead of what you read in journal articles written by academics) of the statistical foundations of resiliency, n-modular redundancy etc.

In real-life outside of a journal article, it's a lot harder than just deciding whether you want to prioritize operating margin or resiliency at 5000 feet.

In real life when these sorts of edge cases happen, you have to understand in minutes or sometimes seconds the tradeoffs in terms of costs to your own company and your customers of one of n specific possible failure modes and risk-manage so you minimize the probability of the catastrophic outcomes. This sometimes may involve increasing the probability of low cost bad outcomes. You can't reason about this stuff before hand. If you could, you would have designed your system to not fail in that manner.