Hacker News new | ask | show | jobs
by yourapostasy 1242 days ago
> If you have LOTO systems etc, what do you do when a sensor fails...

It pretty much boils down to: it depends upon what the business wants to prioritize; operating margin or resiliency. There is an entire subfield investigating the statistical foundations of resiliency, and the general case of N-modular redundancy is in practice implemented as triple modular redundancy in most commercial systems that want to spend in this vector.

> Like it or not, these highly trained individuals are going to make mistakes every now and then.

Absolutely, and here is where the organization's no-blame learning culture swings into action for the well-led teams.

> It's a lot harder than it looks.

We all know this, and we can all help each other get better to deliver ever increasing value to our customers by sharing what works for the context we deployed within!

1 comments

You don't get to major failures once a decade (or less) on systems this complex without understanding and in fact being on the cutting edge (likely ahead of what you read in journal articles written by academics) of the statistical foundations of resiliency, n-modular redundancy etc.

In real-life outside of a journal article, it's a lot harder than just deciding whether you want to prioritize operating margin or resiliency at 5000 feet.

In real life when these sorts of edge cases happen, you have to understand in minutes or sometimes seconds the tradeoffs in terms of costs to your own company and your customers of one of n specific possible failure modes and risk-manage so you minimize the probability of the catastrophic outcomes. This sometimes may involve increasing the probability of low cost bad outcomes. You can't reason about this stuff before hand. If you could, you would have designed your system to not fail in that manner.