Hacker News new | ask | show | jobs
by lostlogin 4727 days ago
I too went down the Chernobyl history rabbit hole during so-called study. It's absolutely incredible. The big thing that gets me (which seems to happen at work in rather less cataclysmic style, and in every industry all the time) is the way that every misstep was followed by those involved making the worst possible decision. Then the next thing done compounds the earlier events - often with the chain of events occurring over months. Why do people universally do the worst possible thing at the worst possible time, and miss every opportunity for course correction?
2 comments

Potentially relevant: "How Complex Systems Fail"

http://www.ctlab.org/documents/How%20Complex%20Systems%20Fai...

I think the people operating complex systems are regularly making mistakes and then correcting them. When mistakes are recovered, the issue never becomes a problem, and there is no postmortem to come to our attention. It's only the cases where mistakes are made repeatedly over a long period, and the outcome is horrific, that the incident comes to our attention. It's a form of selection bias.

It also depends on the fundamental resilience of the system, whether or not single failures compound, whether or not the fail-safes themselves have failures or faults within them, and how personnel (and management) respond in the event of failure.

Known, simple, redundant, and stable systems which tend to return to modes of stability, which don't tend to experience runaway failure modes, and whose staffs are trained in known (and unknown) failure modes, tend to work well.

Unknown designs (they or staff are new, they're poorly documented, they're acquired from vendors or through organizational acquisition, etc.), whose staff aren't trained in normal and abnormal operations, which do tend to go into runaway failure modes, whose safety or management systems themselves have (known or unknown) bugs, etc., all tend to compound failure modes.

I've had direct experience of this at several levels myself. More frighteningly, I've interviewed senior management of a nuclear facility who candidly admitted that it was poorly managed.

Realize that a 4GW nuclear power plant is producing about $360,000 worth of retail electricity ($0.09/kWh) per hour, and that downtime costs over a million dollars every three hours. Keeping that plant online and operational has a very high priority -- sometimes to the point of cutting corners to do so if short-term objectives may be met at the cost of long-term sustainability.

In aviation, we call that the "swiss cheese model". There are holes distributed within it, and accidents happen when all the holes are aligned.

The corollary is that you must fix the holes as soon as you find them, so they won't all align in your path. And you do find the problems before a disaster, but people don't like to fix things.

Some people do, but most are worried they'll be blamed for either the fact that the problem exists in the first place (complainers, criticizers), for rocking the boat (whistleblowers, trouble-makers), or for fixing it wrong (stupid, careless). Our system, or many, for that matter, isn't set up to reward people who find and fix holes.

That all said, there are exceptions. But pettiness and self-centeredness can so often wreck it, or at least deter people from being bold enough to face judgment or possible error in order to do the right thing.

If you liked "How Complex Systems Fail", you may be interested in "Normal Accidents" [1]. It's basically the long-form complete version of the paper you mentioned and packed with interesting examples.

[1] http://www.amazon.com/Normal-Accidents-Living-High-Risk-Tech...

Yes. Run, don't walk, to buy and read "Normal Accidents".

Anyone involved in systems design, or running systems, needs to read this book.

Part of it is selection bias -- if they hadn't done the stupidest possible thing all the time, then maybe there wouldn't have been a disaster for you to read about.

(I don't think this is a sufficient explanation, but I do think it's part of the explanation.)