Hacker News new | ask | show | jobs
by Grimm665 2368 days ago
I don't know, space is hard and things go wrong. Just because we've done it hundreds of times doesn't mean we should expect perfection.
2 comments

Yes it does?

These are known engineering problems with known engineering solutions. The explanation from Boeing was that a timer was set incorrectly. This sounds like a trivial error to me (though I'm not a "rocket scientist" just a "kerbal scientist", I guess, but we've been using timers for a long time afaik to properly manage burns to orbit).

Let's take a moment to consider the fact that apparently the MCAS uses input from only one of the two AoA sensors on a 737 MAX and swaps which one it takes the data from after each flight. I can't grasp how everyone involved could fail to realize that this statistically makes it less safe than only having one sensor.

I don't know how much the systemic issues that clearly compromised the design of the plane extend to the design of the capsule, but trivals errors seem to be very possible.

> I can't grasp how everyone involved could fail to realize that this statistically makes it less safe than only having one sensor.

This design is bad, but it makes sense as a update of the 737. The flight computer setup is each pilot gets a computer under their chair, each computer gets its own set of sensors and the computers take turns each flight. The flight computer is generally safe (i don't think it's been implicated in any crashes?), but that's because in case of issues in flight, the system usually will disengage and alert the pilot, or if the system takes poor actions, it will disengage when the pilot opposes it, or the pilot will disengage it.

Adding MCAS to the flight computer makes sense, the flight computer needs to be aware of it. It's understandable, but negligent, to add a new feature to the computer without considering the original design. The problem comes in when MCAS was not disclosed to pilots, doesn't disengage on errors, doesn't disengage when pilot input opposes it (partially by design), and can't be disabled except by disabling electric trim, which is more or less needed to recover from the error condition MCAS puts the plane into.

I think this is fixable, but the public information on the current fix doesn't include being able to turn MCAS off, so it doesn't seem like they've really done enough.

Of course they realize it makes it statistically less reliable. I think the gap is it becomes much more difficult to assess the probability of failure between different systems. In the case of MCAS, they already had the ability to override it. In complex systems one domain may think a simple mitigation is sufficient (e.g., the pilot can override MCAS) without understanding the layering of other issues (e.g., human factors like complex controls, lack of training etc.) Meaning from the standpoint of a single domain, that simple mitigation maybe incorrectly be assumed to bring the risk probability into a reasonable range.

I think it’s important to acknowledge the process failures like lack of communication between domains rather than acquiesce to simple conclusions that are more clear only in hindsight.

The procedures and how software systems handle changes to launch time are members of the set of hundreds of thousands of choices made during design and implementation that need to be validated. Yes, they feel like a "silly mistake" but ultimately most things that lead to failure will be in that category.
Or, in other words, getting to space is hard because it requires millions of opportunities for silly mistakes.

Most complex engineering projects are hard not because of one thing, but because of the mind-bogglingly large number of things that must all be done correctly.

(1 - x) ^ y, where x is the chance of each small mistake and y is the total number of opportunities, doesn't need a very large x, if y is large enough, for things to start looking dicey.

Yeah, this is a key insight, and something I didn't learn as a software developer with many years of experience until I studied probability formally. Maybe these days this is better known. This is also known as the inclusion-exclusion principal and can be used to model failure probability.
Thanks for the term! That's combinatorics, so I probably should have remembered it. :/

There are a lot of things I'd love to know the accepted name for, as I came into understanding through the backdoor. I regret that my college CS track didn't include more borrow-courses from physical engineering on reliability (and control theory). So many valuable, applicable lessons.

At some point we do, and that time is now. How qualified are you to give them a pass?