Hacker News new | ask | show | jobs
by quantgenius 1242 days ago
It's very easy to talk about completely automated systems and LOTO and you need these when you have under-skilled staff. The NYSE likely does NOT have under trained staff. If you have LOTO systems etc, what do you do when a sensor fails and you can't figure out why your method for checking whether the other system is running incorrectly thinks it is. Do you allow the stock market to simply not open?

What if multiple sensors fail or it's an ambiguous situation like say you are deciding whether or not to fail over a power circuit and it's a brownout but not a complete power failure? What if there is a systemic problem and it's likely the backup power source is going to brown out too? At some point you need highly skilled individuals, like say trained airline pilots flying a plane who have the authority to override systems immediately without having to jump through hoops.

This is especially true for mission critical systems. Many of the mission critical systems we rely on are NOT built on the cloud, i.e. other people's computers because you want to be really careful about what hardware you are using, precisely how your data center is setup and want to make sure things like a noisy neighbor do not impact you.

Like it or not, these highly trained individuals are going to make mistakes every now and then. A failure like this once every decade or so really isn't so bad. The individual who made this error is likely not a "grunt". I suspect the individual in question will not necessarily suffer any major consequences as a result of this unless it wasn't a mistake but a flagrant disregard for the rules like say bringing a bottle of water into a data center that then spilled or something.

Have you built a mission critical, distributed system that hasn't failed for 10 years? It's a lot harder than it looks. That's how often the NYSE has a problem like this, about once a decade. A lot of things that work in theory, don't work for the edge cases and things that lead to problems once a decade or so are extreme edge cases.

In the grand scheme of things a mucked up opening auction is a minor problem and anyone who did not take the precaution of sending a limit order and sent a market on open order despite it being standard practice to essentially always use limits and go hurt badly will be made whole.

1 comments

> If you have LOTO systems etc, what do you do when a sensor fails...

It pretty much boils down to: it depends upon what the business wants to prioritize; operating margin or resiliency. There is an entire subfield investigating the statistical foundations of resiliency, and the general case of N-modular redundancy is in practice implemented as triple modular redundancy in most commercial systems that want to spend in this vector.

> Like it or not, these highly trained individuals are going to make mistakes every now and then.

Absolutely, and here is where the organization's no-blame learning culture swings into action for the well-led teams.

> It's a lot harder than it looks.

We all know this, and we can all help each other get better to deliver ever increasing value to our customers by sharing what works for the context we deployed within!

You don't get to major failures once a decade (or less) on systems this complex without understanding and in fact being on the cutting edge (likely ahead of what you read in journal articles written by academics) of the statistical foundations of resiliency, n-modular redundancy etc.

In real-life outside of a journal article, it's a lot harder than just deciding whether you want to prioritize operating margin or resiliency at 5000 feet.

In real life when these sorts of edge cases happen, you have to understand in minutes or sometimes seconds the tradeoffs in terms of costs to your own company and your customers of one of n specific possible failure modes and risk-manage so you minimize the probability of the catastrophic outcomes. This sometimes may involve increasing the probability of low cost bad outcomes. You can't reason about this stuff before hand. If you could, you would have designed your system to not fail in that manner.