Hacker News new | ask | show | jobs
by jasode 2602 days ago
>as software engineers we find a bug and just fix it. [...] Unfortunately, the recent 737 MAX incidents seem to have changed this.

I think there's some nuance about MCAS that's lost in all the media reports. As far as I understand, the MCAS software didn't have a "bug" in the sense we programmers typically think of. (E.g. Mars Climate Orbiter's software programmed with incorrect units-of-measure.[0])

Instead, the MCAS system was poorly designed because of financial pressure to maintain the fiction of a single 737 type rating.

In other words, the MCAS software actually did what Boeing managers specified it to do:

1) Did the software only read a _1_ AOA sensor with a single-point-of-failure instead of reading _2_ sensors? Yes, because that was what Boeing managers wanted the software to do. It was purposefully designed that way. If the software was changed to reconcile 2 sensors, it would then lead to a new "AOA DISAGREE" indicator[1] which would then raise doubts to the FAA that Boeing could just give pilots a simple iPad training orientation instead of expensive flight-sim training. Essentially, Boeing managers were trying to "hack" the FAA criteria for "single type rating".

2) Did software make adjustments of an aggressive and unsafe 2.5 degrees instead of a more gentle and recoverable 0.6 degrees? Yes, because Boeing designed it that way.

Somebody at Boeing specified the software design to be "1 sensor and 2.5 degrees" and apparently, that's what the programmers wrote.

I know we can play with semantics of "bug" vs "design" because they overlap but to me this seems to be a clear case of faulty "design". The distinction between design vs bug is important to let us fix the root cause.

The 737 MAX MCAS software issue isn't like the Mars Climate Orbiter or Therac-25 software bugs. The lessons from MCO and Therac-25 can't be applied to Boeing's MCAS because that unwanted behavior happens in a layer above the programming:

- MCO & Therac: design specifications are correct; software programming was incorrect

- Boeing 737MAX MCAS: design specifications incorrect; software programming was "correct" -- insofar as it matched the (flawed) design specifications

[0] https://en.wikipedia.org/wiki/Mars_Climate_Orbiter#Cause_of_...

[1] yellow "AOA Disagree" text at the bottom of display: https://www.ainonline.com/sites/default/files/styles/ain30_f...

5 comments

That's a different issue. Aircraft systems are classified as to degree of risk. This is from MIL-STD-882C.

- I Catastrophic - Death, and/or system loss, and/or severe environmental damage.

- II Critical - Severe injury, severe occupational illness, major system and/or environmental damage.

- III Marginal - Minor injury, and/or minor system damage, and/or environmental damage.

- IV Negligible - Less then minor injury, or less then minor system or environmental damage.

Now, face it, most webcrap and phone apps are at level IV. Few people in computing outside aerospace regularly work on Level I systems. (Except the self-driving car people, who are working at Level I and need to act like it.)

MCAS started as just an automatic trim system. Those have been around for decades, and they're usually level III systems. They usually have limited control authority, and they usually act rather slowly, on purpose. So auto trim systems don't have the heavy redundancy required of level I and II systems. Then the trim system got additional functionality, control authority, and speed to provide the MCAS capability. Now it could cause real trouble.

At that point, the auto trim system had become a level I system. A level I system requires redundancy in sensors, actuators, electronics, power, and data paths. Plus much more failure analysis. A full fly-by-wire system or a full authority engine control system will have all that.

So either MCAS needed to have more limited authority over trim, so it couldn't cause trim runaway, or it needed the safety features of a Level I system. Boeing did neither. Parts of the company seem to have thought the system didn't have as much authority as it did. ("Authority", in this context, means "how much can you change the setting".)

Management failure.

> So either MCAS needed to have more limited authority over trim, so it couldn't cause trim runaway, or it needed the safety features of a Level I system.

There are two other dodgy things going on. One you can't disable MCAS without totally disabling the electric trim. And the mechanical advantage of the manual trim isn't sufficient to re-adjust trim once it's too far out. And hasn't been _forever_.

"If the software was changed to reconcile 2 sensors, it would then lead to a new "AOA Disagree" indicator which would raise doubts to the FAA that Boeing could just give pilots a simple iPad training orientation instead of expensive flight-sim training."

I always liked this quote from the "Mythical Man-Month": “Never go to sea with two chronometers, take one or three”.

https://blog.ipspace.net/2017/01/never-take-two-chronometers...

>I always liked this quote from the "Mythical Man-Month": “Never go to sea with two chronometers, take one or three”.

(I can't tell if you're making a side comment or specifically replying to the categorization of "software bug".)

Yes, the 737 MAX only has 2 AOA sensors instead of 3 like Airbus A320. This is a physical design of sensors mounted on the airframe. But this aerospace engineering design detail seems outside the scope of assigning blame to software programmers writing code. (There isn't a software coding methodology that makes a 3rd AOA sensor appear.)

It was mostly a side comment, but it seemed appropriate in this context.

For such an important system, I would think that a single instance would be too little (single point of failure: even in less mission critical system it wouldn't be allowed), while using two you would not be able to resolve a reading conflict between them (which is the point of the quote), so three is probably a reasonable number.

Even if applied to design and not to software programming, the concept is still sound. The point was what the quote meant, not its source.

Using two would have been strictly better a design than just one. Obviously three is better but two disagreeing and shutting down mcas would be much, much preferable than relying on one to the grave.
Preferable to the passengers, but not to Boeing, because that little warning light would warrant retraining and re-certification of pilots.

It's why they didn't do it.

>Instead, the MCAS system was poorly designed because of financial pressure to maintain the fiction of a single 737 type rating.

OK but how do we know, how is it demonstrated, that this financial pressure condition has now been mitigated? What is the exact nature of the "fix"? And actually what are all of the closed door conversations, back then and now, about the various possible behaviors for this software routine? How is it they came up with that one? How is it they come up with the new one? And really, why is the first one wrong (aside from the fact there are a bunch of dead people, which is a consequence of the original error)?

And which parts of the design? There are many parts to it. Not all of them are as bad as others.

As a pilot I find it impossible to imagine a closed door room with engineers not computing, let alone not imagining, the potential for this particular failure mode. And if a pilot were present in that closed door session, I find it impossible they would not immediately be bothered by the potential for mistrim at low altitude that would result in too scary a probability of unrecoverability.

It makes me wonder if pilots were even involved at that level of the design and decision making for the feature.

It makes me wonder if pilots were even involved at that level of the design and decision making for the feature.

They were not.[1]

[1] https://www.wsj.com/articles/boeings-own-test-pilots-lacked-...

I'm aware of this reporting, that's the test pilots. I'm talking about engineers who are pilots, and pilots consulted for the design portion, whether they be in-house, customers, or consultants brought on board.

If you're a food products company, you bring in people to do some kind of product testing to see if they like it or hate it. So far I have yet to read or hear from a pilot who says: Oh this is GREAT idea! Love the entire concept!

Yes, everyone wants to blame the pilots or the engineers at Boeing.

This was not an engineering problem, it was a greed-created management intentional decision. It was management designed for failure because management changed the goal to put money over life in distinct ways.

There is no other useful meaning of "correct code" apart from "matches specification/design". There is no notion of correctness for design. The design may not be consistent with safety requirements for example.
Of course there is correctness for design!

When reviewing a design, the first thing to verify is if it can satisfy its input requirements. In your example, a design that has to satisfy a safety requirement but doesn't is not correct and must be rejected.

My comment tried to use the same words as its parent:

> Somebody at Boeing specified the software design to be "1 sensor and 2.5 degrees"

What is called a "software design" there, you would probably call a "requirement".

I agree with you. Your meaning of "design" has a notion of correctness.

The safety requirements are one of the design decisions that have to be made, not a separate thing that exists outside design space.
No!

Requirements are product features that must be present. A Design is one of many potential ways to satisfy that set of Requirements.

For example, a requirement might be "the user shall not be exposed to hazardous voltages (defined elsewhere) when servicing the equipment."

A possible Design solution might be "provide cover interlock switches so when the covers are opened, all voltage supplies are disconnected." or "software monitors a cover switch, and when that particular cover is opened, a command is sent to the power controller to disconnect power to anything that is reachable from that opening."

Which of the two (or other) design options is chosen, is a Design Decision, but they are means to an end, that end being Satisfying The Requirement.