Hacker News new | ask | show | jobs
by bumby 1562 days ago
They'll generally use a Failure Mode Effects Analysis (FMEA). So in this example, designers would identify all the ways a tile could fail and the consequence and probability of that failure. They then go through the process of mitigating it. The order of precedence for mitigations is 1) remove the hazard, 2) engineer around the hazard, 3) administrative controls (like standard procedures), 4) personal protective equipment. The iterate around this until the risk is within an acceptable range. All those mitigations become requirements.

So let's say they identify a tile failure mode as "tile struck by object". They assign a worst-case severity to that. Let's say they knew how bad it could be and they assign a severity as "loss of crew." Then they have to identify all the ways the tile could be struck and assign probabilities to that even happening. They use a matrix that maps the severity and probability to arrive at a risk classification. If the classification is higher than their threshold, they add mitigations that either reduce the severity or the probability (or both) until it's within an acceptable risk range.

There's lots that can go wrong with this process, though. You obviously have to be able to identify the failure modes. Is there some off-the-wall failure that nobody could foresee? Maybe. Then you have to have good enough data to objectively determine the risk. In this case, I wonder if all the previous foam strikes led them to discredit the risk as being improbable/negligible to cause that failure mode. Add to that, the PowerPoint seemed to imply the model they used is too conservative (it was believed to overestimate the actual penetration). I know people involved on some hypervelocity testing of the foam and they were legitimately surprised at the way the foam acted when it was fired at higher speeds. So in this case, the risk was probably unknown beforehand, although they assumed they understood the risk sufficiently. To quote Mark Twain, "What gets us into trouble is not what we don't know. It's what we know for sure that just ain't so."

That's just one system on an immensely complex machine. It's easy to sit back with hindsight and say "Well, they shouldn't have made a decision until they did additional testing to get the data." But if they did that to every system on the Shuttle, it likely wouldn't have left the ground. In practice, engineers deal with all kinds of other cost and schedule constraints.

1 comments

This issue is not that they had to ground the Shuttle until they had the data. The issue seems to be that foam was hitting the tiles with parameters outside their test database.

Why didn't they go back and test with 'real world' foam sizes?

I can only speculate.

I would push back on the idea that they would not have to ground the Shuttle. If they thought the foam could cause a loss of crew, they would ground the Shuttle until they fully understood the problem. That's exactly what happened in the aftermath of Columbia.

>Why didn't they go back and test with 'real world' foam sizes?

That's exactly what they did after the incident (while the Shuttles were grounded). If you're asking why didn't they do that beforehand, my assumption is they already had a model that they felt they could use. According to the subject PPT slides, they even thought that model was overly conservative. In addition, while foam-shedding was out of spec, it was considered "in family" meaning that they knew of the issue and felt like it was not a flight safety issue. Both their physical and mental models of the phenomena were, at best, incomplete but they didn't know that at the time.

So in your opinion, the slide said that with the impact of the foam it would have been very unlikely that the tile would have failed? In that case the inpretation by NASA of the slide was correct.

Which is weird because slide also mentions that a small increase in energy can have a disproportional effect.

I find it weird that they would rely on their model (for extrapolation) when they know that the behavior of the tiles is non-linear. If they knew that the real world was outside their testing parameters and they decided not to test, then that sounds to me like a very serious ommision.

I.e., it is weird to extrapolate tests to something 600 times bigger. Certainly if it is about impact on ceramics.

Here's how I would interpret the slide, doing my best to prevent hindsight from biasing my opinion (since we already know what happened, it's tough to do).

Bullet 1: We looked at all the model data

Bullet 2: The model tends to predict deeper penetration than what we see in practice.

Bullet 3: The model penetration is related to particle velocity

Bullet 4: The penetration is related to the particle mass and surface area (they say volume)

Bullet 5: The foam is soft, so it takes a lot of energy to penetrate the hard ceramic tiles

Bullet 6: It is possible for the foam to penetrate the tiles, though, given enough energy

Bullet 7: If the foam does penetrate, it can cause a bad day

Bullet 8: It doesn't take much beyond the penetration energy to cause a bad day

Bullet 9: We haven't run tests that match the conditions of the strike so we don't have good data

Bullet 10: The foam piece is much, much larger than the stuff we tested

Given that, I would summarize it to say "The foam strike is much larger than what we've tested. All we know is this may mean there was significantly more energy involved, and if it's above the penetration threshold it can be bad. But the model seems to be overly conservative regarding penetration"

Now, the difficult decision is in the constraints. The astronauts didn't have the fuel to get to the ISS. They didn't have EVA suits to attempt a repair or evaluate the damage. There was no plan in place for a rescue mission. Atlantis was being prepped in FL, but was not currently ready. The astronauts probably have, at most, 30 days of oxygen.

Option 1: Allow for re-entry. Some had pegged this at about a 30% chance of success, but I have no idea what that is based upon.

Option 2: Scramble Atlantis to try a rescue mission. This is very risky for a number of reasons. Atlantis wasn't ready, meaning it would have to be rushed, increasing the chances that errors occur. Also, this type of mission was never attempted before, where two Shuttles are within spitting distance and astronauts have to migrate from one to the other. In order for the timeline to be feasible, a decision must be made within 1 day, 2 tops. This would then potentially risk two Shuttle crews instead of one, with an inherently risky mission.

I don't know what other options were available. Given when it occurred in launch window and how long it took to understand there was a problem, an abort wasn't possible. Shuttle did not have a launch abort engine like capsules do. Keep in mind, these decisions have to be made under large amounts of uncertainty. The extent of the damage was not fully known. People could scramble to run additional tests to gather data, but by that time the window on Option 2 may have closed. It's easy to arm-chair quarterback this after the fact, but there weren't good, clear options at the moment.

This was the sixth time that exact piece had broken off and hit the vehicle. It was one roll of the dice too many. And by the way, Columbia had plenty of O2 but only 30 days worth of CO2 scrubber capacity.
Thanks for correcting on the CO2.

>This was the sixth time that exact piece had broken off and hit the vehicle.

That was part of the problem and what is meant by the foam shedding being "in family." They had witnessed it enough (along with foam shedding elsewhere) without consequence that it wasn't really considered a credible risk. Until it happened during a period of the launch where the delta-v made it a different scenario, with an energy that caused unexpected characteristics about the foam.

This is second hand knowledge, but the person I know involved in the testing said they had a really hard time recreating the damage to tiles on the specs they were given after Columbia. By chance, they decided to turn the gun beyond the quoted specs (as I understand it), and all of a sudden the foam acted like a hard chunk of debris. I think it's hard for people to grasp how much is unknown, even after a disaster. Often, it's only in hindsight where it seems obvious.