Hacker News new | ask | show | jobs
by bumby 1562 days ago
>It is really weird to conclude that based on absence of data, it is probably safe.

Considering that's exactly what happened nearly 20 years earlier with Challenger, it seems to be more common and likely the result of a number of cognitive biases. We read these with some hindsight and are disconnected from all the other pressures (schedule, budget, peer, etc.) they are dealing with at the time.

2 comments

That points to a far more fundamental problem. Related to information processing higher up in the organisation. Just making better slides is unlikely to solve that problem.
Probably correct, and I have doubts that those types of problems are easily fixed because they're rooted in human psychology. It's interesting to me that the "big" incidents seem to occur every 15-20 years, almost as if there is a new professional cohort who has to learn the hard way. I do think clear communication is a necessary, but insufficient, element of fixing that problem.
One thing I wonder about with these kinds of accidents: to what extend does operational experience work its way back to requirements of components.

For example, if regularly pieces of foam are hitting the tiles after launch, was that part of the specs for the tiles to handle that? Did anybody go back, take a worst case scenario of a piece foam hitting a tile (size, speed, etc.) and verify that the tiles could handle such an impact?

They'll generally use a Failure Mode Effects Analysis (FMEA). So in this example, designers would identify all the ways a tile could fail and the consequence and probability of that failure. They then go through the process of mitigating it. The order of precedence for mitigations is 1) remove the hazard, 2) engineer around the hazard, 3) administrative controls (like standard procedures), 4) personal protective equipment. The iterate around this until the risk is within an acceptable range. All those mitigations become requirements.

So let's say they identify a tile failure mode as "tile struck by object". They assign a worst-case severity to that. Let's say they knew how bad it could be and they assign a severity as "loss of crew." Then they have to identify all the ways the tile could be struck and assign probabilities to that even happening. They use a matrix that maps the severity and probability to arrive at a risk classification. If the classification is higher than their threshold, they add mitigations that either reduce the severity or the probability (or both) until it's within an acceptable risk range.

There's lots that can go wrong with this process, though. You obviously have to be able to identify the failure modes. Is there some off-the-wall failure that nobody could foresee? Maybe. Then you have to have good enough data to objectively determine the risk. In this case, I wonder if all the previous foam strikes led them to discredit the risk as being improbable/negligible to cause that failure mode. Add to that, the PowerPoint seemed to imply the model they used is too conservative (it was believed to overestimate the actual penetration). I know people involved on some hypervelocity testing of the foam and they were legitimately surprised at the way the foam acted when it was fired at higher speeds. So in this case, the risk was probably unknown beforehand, although they assumed they understood the risk sufficiently. To quote Mark Twain, "What gets us into trouble is not what we don't know. It's what we know for sure that just ain't so."

That's just one system on an immensely complex machine. It's easy to sit back with hindsight and say "Well, they shouldn't have made a decision until they did additional testing to get the data." But if they did that to every system on the Shuttle, it likely wouldn't have left the ground. In practice, engineers deal with all kinds of other cost and schedule constraints.

This issue is not that they had to ground the Shuttle until they had the data. The issue seems to be that foam was hitting the tiles with parameters outside their test database.

Why didn't they go back and test with 'real world' foam sizes?

I can only speculate.

I would push back on the idea that they would not have to ground the Shuttle. If they thought the foam could cause a loss of crew, they would ground the Shuttle until they fully understood the problem. That's exactly what happened in the aftermath of Columbia.

>Why didn't they go back and test with 'real world' foam sizes?

That's exactly what they did after the incident (while the Shuttles were grounded). If you're asking why didn't they do that beforehand, my assumption is they already had a model that they felt they could use. According to the subject PPT slides, they even thought that model was overly conservative. In addition, while foam-shedding was out of spec, it was considered "in family" meaning that they knew of the issue and felt like it was not a flight safety issue. Both their physical and mental models of the phenomena were, at best, incomplete but they didn't know that at the time.

This could also be related to a broader tendency to promote 'performers' who are more likely to take risks or shortcuts that they might not realize involve risks as well as people that use less resources (lower safety margins, less overlapping checks etc).

It's sadly difficult to be recognized for excellence in preventing surprises, as hard as it is to quantify that.

I think this is absolutely part of the issue. Having previously worked in the industry, people who bring up concerns are sometimes viewed as pariahs who are slowing down work. Because so many of the concerns involve low-probability events, it's possible for someone to make a career rolling the dice without being cognizant of (or open about) the risks. When bad things do happen (thankfully, major catastrophes are still relatively rare), it's hard for people to openly recognize the mitigations that could have prevented it because they think instituting them on future projects will just slow things down. It creates a culture of "the ends justify the means" where bad judgement and integrity violations are considered ok as long as the project/program was completed.
One factor for why is that bringing bad news may poorly reflect on the organization, and therefore the person’s career.
It takes some intestinal fortitude to be in a role that is tasked with communicating information people don't want to hear. It's part of the reason NASA created it's "Safety and Mission Assurance" organization after this incident and gave them a completely different chain of command. In theory, that mitigates some of the career threat, but in practice it may be different.