Hacker News new | ask | show | jobs
by kurthr 1220 days ago
There's been a lot of work on reliability of complex systems and how they operate. What has been found is that it is almost always necessary to have failure (degraded operation) modes that prevent system failure, and the more complex and more hazardous failure is the more modes develop.

In these systems it is found that they are almost always operating (or transitioning between) failure modes. Often multiple operational failure modes are simultaneous. It becomes very important to test the system in each of it's failure modes and their combinations to maintain high up time.

https://how.complexsystems.fail/ is an example, but there are many.

Human work, development, and maintenance is itself a system that interacts with these critical systems. Frankly, failure to fail causes failure (thus chaos monkey). The mythical man month is almost a sub category of these failures as are HR hiring processes and other BS. Being too successful and not having competition (or similarly sclerotic competition) can be as much of a hazard as "move fast, break things".

1 comments

"When a fail-safe system fails, it fails by failing to fail-safe."