| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jasode 2602 days ago

>as software engineers we find a bug and just fix it. [...] Unfortunately, the recent 737 MAX incidents seem to have changed this.

I think there's some nuance about MCAS that's lost in all the media reports. As far as I understand, the MCAS software didn't have a "bug" in the sense we programmers typically think of. (E.g. Mars Climate Orbiter's software programmed with incorrect units-of-measure.[0])

Instead, the MCAS system was poorly designed because of financial pressure to maintain the fiction of a single 737 type rating.

In other words, the MCAS software actually did what Boeing managers specified it to do:

1) Did the software only read a _1_ AOA sensor with a single-point-of-failure instead of reading _2_ sensors? Yes, because that was what Boeing managers wanted the software to do. It was purposefully designed that way. If the software was changed to reconcile 2 sensors, it would then lead to a new "AOA DISAGREE" indicator[1] which would then raise doubts to the FAA that Boeing could just give pilots a simple iPad training orientation instead of expensive flight-sim training. Essentially, Boeing managers were trying to "hack" the FAA criteria for "single type rating".

2) Did software make adjustments of an aggressive and unsafe 2.5 degrees instead of a more gentle and recoverable 0.6 degrees? Yes, because Boeing designed it that way.

Somebody at Boeing specified the software design to be "1 sensor and 2.5 degrees" and apparently, that's what the programmers wrote.

I know we can play with semantics of "bug" vs "design" because they overlap but to me this seems to be a clear case of faulty "design". The distinction between design vs bug is important to let us fix the root cause.

The 737 MAX MCAS software issue isn't like the Mars Climate Orbiter or Therac-25 software bugs. The lessons from MCO and Therac-25 can't be applied to Boeing's MCAS because that unwanted behavior happens in a layer above the programming:

- MCO & Therac: design specifications are correct; software programming was incorrect

- Boeing 737MAX MCAS: design specifications incorrect; software programming was "correct" -- insofar as it matched the (flawed) design specifications

[0] https://en.wikipedia.org/wiki/Mars_Climate_Orbiter#Cause_of_...

[1] yellow "AOA Disagree" text at the bottom of display: https://www.ainonline.com/sites/default/files/styles/ain30_f...

5 comments

Animats 2601 days ago

That's a different issue. Aircraft systems are classified as to degree of risk. This is from MIL-STD-882C.

- I Catastrophic - Death, and/or system loss, and/or severe environmental damage.

- II Critical - Severe injury, severe occupational illness, major system and/or environmental damage.

- III Marginal - Minor injury, and/or minor system damage, and/or environmental damage.

- IV Negligible - Less then minor injury, or less then minor system or environmental damage.

Now, face it, most webcrap and phone apps are at level IV. Few people in computing outside aerospace regularly work on Level I systems. (Except the self-driving car people, who are working at Level I and need to act like it.)

MCAS started as just an automatic trim system. Those have been around for decades, and they're usually level III systems. They usually have limited control authority, and they usually act rather slowly, on purpose. So auto trim systems don't have the heavy redundancy required of level I and II systems. Then the trim system got additional functionality, control authority, and speed to provide the MCAS capability. Now it could cause real trouble.

At that point, the auto trim system had become a level I system. A level I system requires redundancy in sensors, actuators, electronics, power, and data paths. Plus much more failure analysis. A full fly-by-wire system or a full authority engine control system will have all that.

So either MCAS needed to have more limited authority over trim, so it couldn't cause trim runaway, or it needed the safety features of a Level I system. Boeing did neither. Parts of the company seem to have thought the system didn't have as much authority as it did. ("Authority", in this context, means "how much can you change the setting".)

Management failure.