Hacker News new | ask | show | jobs
by fghgfdfg 3490 days ago
I think it's also important to note that the inertial platform was developed for the Ariane 4 where it worked correctly.

The software was actually developed correctly, and functioned as intended. At least for it's intended use. Then it was tossed at a new use-case without any accounting for any differences in the new situation.

1 comments

> The software was actually developed correctly

Not quite. If you read the details about the case you can find that it didn't have the handler for the overflow in the calculations(!) It's similar to this case now that both were developed with under the assumptions "can't happen," in the sense, developed to be too brittle, for the inputs that were certainly possible to happen as soon as the trajectory (in the case of Ariane 5) or the duration of the spinning movement (this case now) doesn't match their initial test cases.

Still, the development, especially in this kind of projects, is always a balancing act to organize covering most of the cases that can go wrong. Murphy's law works against the whole organization. Given the amount of real problems, I'm still amazed that the Apollo 11 succeeded.

Or even that there weren't any really destructive "accidents" involving rockets with the nuclear warheads. Think about it, these are prone to the same problems any other computer-related projects are: the amount of the damage is effectively infinitely larger than the effort needed to start it.

https://www.theguardian.com/world/2016/jan/07/nuclear-weapon...

“These weapons are literally waiting for a short stream of computer signals to fire. They don’t care where these signals come from.”

“Their rocket engines are going ignite and their silo lids are going to blow off and they are going to lift off as soon as they have the equivalent of you or I putting in a couple of numbers and hitting enter three times.”

http://thebulletin.org/

"It is 3 minutes to midnight"

Also: "How Risky is Nuclear Optimism?"

http://www-ee.stanford.edu/%7Ehellman/publications/75.pdf

And if you still think "but it works, the proof is that it hasn't exploded up to now", just consider this graph from Nassim Taleb:

http://static3.businessinsider.com/image/5655f69c8430765e008...

> Not quite. If you read the details about the case you can find that it didn't have the handler for the overflow in the calculations(!) It's similar to this case now that both were developed with under the assumptions "can't happen," in the sense, developed to be too brittle, for the inputs that were certainly possible to happen as soon as the trajectory (in the case of Ariane 5)

I'm not sure that's entirely fair. The software was intended for the Ariane 4 which wasn't intended to have as much horizontal acceleration as the 5. If the 4 had experienced such an acceleration it wasn't intended to be capable of recovering from it. That area of the code also explicitly had some protections provided by the language removed for the sake of efficiency. So it wasn't a total oversight that just happened to work out - there was a decision made based on the fact the rocket had already irrecoverably failed if the situation ever occurred.

While I agree it's somewhat distasteful not to cover all the bases in the most technically correct way all the time, I'm not sure how important it is to have an overflow handler fire in the inertial reference system just as the rocket self-destructs.

> That area of the code also explicitly had some protections provided by the language removed for the sake of efficiency

As far as I know the efficiency wasn't the issue, just that the "model" was, as I've said, brittle. The overflow was to be handled with what we'd today call "an exception handler" and the selected solution was, instead of (reasonably) writing "keep the maximum value as the result" handler, to leave the processor effectively executing random code in the case the overflow occurs. And the "exception" occured. It's not that the overflow detection was turned off to save the cycles, or that some default handling was provided. It was that it was handled with "whatever" (execute random instructions)! by intentionally omitting the handlers.

I don't really see that as the main point. Perhaps I shouldn't have mentioned it at all.

I don't see the practical issue with a model being brittle in the face of imminent mission failure. The model breaking down shortly before you self-destruct the whole thing seems like a rather minor concern. It's entirely irrelevant at that point what the model is.

It turns into an issue when somebody throws the software into a new environment without looking at it or it's requirements and then doesn't do any testing with it. But that's not on the original developers. Their solution was entirely valid for their problem.

Even if they had done something like report the maximum value instead, the rest of the software for the Ariane 5 could well have been expecting it to do something else entirely which would still result in a serious problem.

It's an issue of inappropriately using software in a new situation. Without knowing and account for how it behaves, you can't just use it and expect everything to work perfectly the first time around. It doesn't matter how well the software accounts for various issues - at some point something won't have only a single correct answer and the software you are using will have to pick how to behave. If you aren't paying attention to that, it can/will come back to bite you.

> It doesn't matter how well the software accounts for various issues - at some point something won't have only a single correct answer

It does, immensely. That's why we have floating point processing units instead of the fixed point. Think about it: even the single precision FP allows you to have "expected" responses between 10E-38 to 10E38. There are less stars in the observable universe. The double precision FP allows the ranges of inputs and outputs to be between 10E−308 and 10E308: there are only 10E80 atoms in the whole observable universe. Can the response which says how much the rocket is "aligned" be meaningful -- sure it can.

This piece of program catastrophically failed because some input was a just somewhat bigger than before.

Properly programmed components that are supposed to handle "continuous" inputs and provide "continuous" outputs (and that is the specific part we talk about) should not have "discontinuities" at the arbitrary points which are the accidents of some unimportant implementation decisions (leaving "operand error" exception for some input variables but protecting from it for others!).

I can understand that you don't understand this if you never worked in the area of the numerical computing or signal processing or something equivalently part of the "real life" responses, but I hope there are still enough professionals who know what I talk about.

Again from the report:

"The internal SRI software exception was caused during execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer. This resulted in an Operand Error. The data conversion instructions (in Ada code) were not protected from causing an Operand Error, although other conversions of comparable variables in the same place in the code were protected.

The error occurred in a part of the software that only performs alignment of the strap-down inertial platform. This software module computes meaningful results only before lift-off. As soon as the launcher lifts off, this function serves no purpose."

> That's why we have floating point processing units instead of the fixed point.

I'm not sure what that is supposed to mean. I was talking generally. Not every situation has a single appropriate value to represent it. I don't particularly care if this one example could have used a floating point or not.

> This piece of program catastrophically failed because some input was a just somewhat bigger than before.

As far as the software was concerned the rocket had already catastrophically failed. It actually hadn't, because it was a different rocket than the software was designed for. It was "somewhat bigger" in the sense that it was large enough that the rocket the software was designed for would have been in an irrecoverable situation.

> Properly programmed components that are supposed to handle "continuous" inputs and provide "continuous" outputs (and that is the specific part we talk about) should not have "discontinuities" at the arbitrary points which are the accidents of some unimportant implementation decisions (leaving "operand error" exception for some input variables but protecting from it for others!).

That's theoretically impossible. If you want to account for every possible value you're going to need an infinite amount of memory. There will be a cutoff somewhere, no matter what. Even if that cutoff is the maximum value of a double precision float - that's an arbitrary implementation limitation. You can't just say you can more than count the stars in the sky and that's clearly and obviously good enough for everything. It's not.

There will be a limit, somewhere. It will be an implementation-defined one. As long as the limit suits the requirements, it effectively doesn't matter. In this case, the limit was set such that if it was reached the mission had already catastrophically failed. That's all that can practically be asked for.

I've checked the report: the exception resulted in the transmission of effectively random data to the main computer:

http://www.math.umn.edu/~arnold/disasters/ariane5rep.html

"g) As a result of its failure, the active inertial reference system transmitted essentially diagnostic information to the launcher's main computer, where it was interpreted as flight data and used for flight control calculations."

So the handler in the processes existed but it effectively confused the main computer. The units shut off but before that sent "the diagnostic." For which there was no handler at all in the main computer. And even more interesting, these processes weren't even needed for the flight. The main computer were able to just ignore such input and the flight would have continued (R1).

Brittle.

> It was that it was handled with "whatever" (execute random instructions)! by intentionally omitting the handlers.

Which is a perfectly valid course of action.

In fact, it is usually the only correct course of action, because there is no other correct course of action to take.

A "keep the maximum value as the result" is always plain wrong (and that extends to all cases of <return whatever fixed value sounds cool>), it wouldn't pass a code review.

Source: That's covered in the "safety & testing" courses of my previous university, that happen to be given by one guy who worked on the Arianes. :p

:) I could have expected that, that these involved have said "it was according to the specs." I don't claim it wasn't. But the commission didn't find that "it had to be all done as it was":

http://www.math.umn.edu/~arnold/disasters/ariane5rep.html

"4. RECOMMENDATIONS"

"R3 Do not allow any sensor, such as the inertial reference system, to stop sending best effort data."

See my other post, they effectively have sent something random ("diagnostics" instead of the data). And this piece of software wasn't even needed to run:

"R1 Switch off the alignment function of the inertial reference system immediately after lift-off. More generally, no software function should run during flight unless it is needed."

And of course, everything wasn't even tested together:

"R2 Prepare a test facility including as much real equipment as technically feasible, inject realistic input data, and perform complete, closed-loop, system testing. Complete simulations must take place before any mission. A high test coverage has to be obtained."

The piece of software was fine. It was done for Ariane 4 and worked as expected.

They re-used it for ariane 5 without checking/adapting it for work in the different environment (more acceleration & thrust). I don't even know what's the name for that kind of mistake. ^^

> See my other post, they effectively have sent something random ("diagnostics" instead of the data).

The software failed. It doesn't matter what it returned at this point. There is nothing to do but to fix the bug in the software.

If it returned "last number" instead of what it did, it would be considered a bug in the exact same way.

For R2, I suppose that they reused the tests from Ariane4 as well :D

What do we do about this?
Act! Share the info, raise the awareness. It seems non-technical people can't imagine how easy the computers and the technology can be catastrophically wrong. The accident will hapen and we must rationally minimize the impacts:

http://nuclearrisk.org

The political action is essential.