Hacker News new | ask | show | jobs
by Athas 2350 days ago
I still think it's interesting to reflect on the mindset that leads to this conclusion. As far as I have been able to determine (although I'm not in the aerospace field), the 737 MAX is procedurally identical to the NG, except when something breaks. The failure modes are slightly different, with potentially lethal results. As a computer scientist, I'm not accustomed to thinking about functional equivalence in the presence of hardware failure, and maybe this Boeing employee was not sufficiently drilled on the need to consider such aspects for aeroplanes. It is of course the fault of Boeing corporate culture and internal procedures that this can be overlooked.
4 comments

>As a computer scientist, I'm not accustomed to thinking about functional equivalence in the presence of hardware failure...

How are you not?

I mean, I get it 90% of the time we screw up the programming somehow, but as a computer scientist, I never ignore the possibility of hardware failure. Memory goes bad. Devices fail. Networks die. Semiconductors transiently in strange ways if you don't take the right precautions...

It's the entire impetus behind GIGO. If you shove garbage into a perfectly working software system; (corrupt data from a malfunctioning input source), you still get out garbage.

It's why life and safety critical automation is so fundamentally different from lower stakes programming tasks where "reboot the damn thing" is a viable option.

If your sensor goes bad, and you're in the air, you can't do squat to fix it. You have to detect the error, and fail the system gracefully by taking it out of the loop, informing the operator of the system failure, and most importantly, never allow that system to do anything that could jeopardize the ability of the operator to continue operating.

This is or at least I thought it was basic Control Systems 101...

> How are you not?

I research compilers and type systems. If the RAM dies while the compiler is running, you rerun the compiler on a new machine. A lot of computer science abstracts away the notion of hardware failure, because otherwise it becomes enormously cumbersome to talk about anything. This is fine as long as you don't actually build real high-reliability systems with the same approach.

>> How are you not?

> I research compilers and type systems.

I hope it's obvious that the software you work on is not supposed to be run during the flight.

The critical software is supposed to do as little as possible, and everything is expected to be in already compiled (and thoroughly verified) state.

And even for the product of yours, as soon as it is not used only for the research but as a production compiler which produces a firmware for the plane, it would have to be proven much more than what is expected from it while it is just an artifact of a research.

In short, even if you are lucky to just do the research, you should be aware (and thankful) that the critical software has other expectations. Including how it responds to failed sensors: different response to the external inputs is a fundamentally different software, even if you never thought about it before.

I think his main point was that for most of us, hardware failure is considered an adequate excuse for why something works -- most of us are not expected to have software that _continues working_ when things break.
The "failures" of the sensors are simply the "less common" inputs. The proper control software should simply be written for all possible inputs, which include inputs from faulty sensors, and the result of the processing should not have some catastrophic consequences.

Compare to the web app that awaits the username, but when the username is not the "most common" (e.g. contains some new unicode symbols, or is of zero lengh) it allows catastrophic security failure and intrusion.

A great deal of procedures focus on failure scenarios. If these are different, then the MAX is procedurally different from the NG.
Yes - this seems to me to be the same erroneous thinking that lead to the Ariane 5 maiden flight loss, caused by an integer overflow. There again the thinking was "this is effectively the same vehicle as its predecessor, therefore we do not need to test it thoroughly".
Similar thinking contributed to Therac-25 as well - the software worked safely on the old hardware, of course it will be safe on the updated hardware.
I suppose it depends on the field in which you work, but many safety-critical fields have an expectation that hardware failures are captured and mitigated and there are various tools to capture these design decisions and ensure they are tested. One example tool in this case would be a software fault analysis (FTA) or failure-mode-effects analysis (FMEA) that looks at a broken sensor input value as a failure mode.[1]

It's been my experience, however, that these sorts of design tools are more unfamiliar to software groups than hardware bubbas. It's not uncommon to simply see "software fails" as a failure mode which isn't very helpful. I'd be curious what the HN community's experience is with software as it relates to design tools like FTAs, FMEAs, hazard analyses, etc.

[1] https://standards.nasa.gov/standard/nasa/nasa-gb-871913