Hacker News new | ask | show | jobs
by whoisburbansky 1881 days ago
I don't mean this to disparage Airbus in any way but after Boeing's issues with the 737 MAX I'd assumed a fairly poor culture of software at airplane manufacturers in general. Super glad to see work like this coming out of Airbus, really makes me rethink my earlier assumptions about software competence in the field.
5 comments

That is such a bizarre viewpoint from my perspective. The absolute deathtrap that is the 737 MAX had two software-related critical failures in 400,000 flights. That constitutes a whole system per-flight software reliability of 2 in ~400,000 or a ~99.9995%, 5 9s. Obviously that is still unacceptable as that is far below the software standard amongst all commercial airplanes where software has not been implicated in a crash for at least the last 10 years except for the 737 MAX. Even if we include the two 737 MAX crashes into the statistics, the whole system per-flight software reliability of all commercial airplanes over the last decade is at least 2 in ~100,000,000 or ~99.999998% or 7 9s. The standard in airplane software is literally 5000x more reliable than AWS SLA guarantees and 500x the holy grail in server software of 5 9s. Even the 737 MAX is 20x better than the AWS guarantee and 2x more reliable than 5 9s. Airplane software is not bad, we just rightfully expect a lot from systems that lives depend on, so even systems that are better than best-in-class non-safety software are completely unacceptable which may give the impression that they are bad in absolute terms as they fail to live up to our expectations.
That’s an interesting way to look at uptime no pun intended

thou I wouldn’t buy a Toyota that exploded every 400,000 trips world wide Or bank with a bank that lost all my money every 400,000 transactions world wide

Indeed, a Toyota with a critical fatality-inducing safety defect every 200,000 trips would be rightfully viewed as a deathtrap. Given that the average trip is probably somewhere around ~30 miles that would be a fatality per 6M miles versus the standard of ~60M miles in the US, or about 10x more dangerous. However, when comparing a car versus airplanes, given that they both fulfill the niche of transportation and are to some degree substitutable, a more reasonable analysis would be fatalities/person-hour or fatalities/person-mile. For fatalities/person-hour the average flight is something like ~2 hours. In the same amount of time 200,000 cars for 2 hours at an average of 40 mph would be ~16M miles, so the 737 MAX is ~4x more dangerous on a person-hour basis than cars. If we go by distance the average flight is ~500 miles, so the 737 MAX had a fatality per 100M person-miles or is ~1.6x safer than driving. That is just how high our standards are with planes that a plane that is viewed as an absolute death machine that is totally unfit for use is safer than its primary alternative for an equivalent distance. A plane that is 100x worse than any other commercial plane is still better than the non-plane alternative on a per-distance basis.

Obviously, this does not excuse their actions as they still made a system at least 100x more dangerous than the standard, but it should give perspective on the difficulty of the problems actually being solved. It is not a bunch of amateurs or below-average engineers who need to adopt basic practices. It is a bunch of highly-skilled professionals developing systems with a level of reliability far beyond what most software developers even think is possible. Even the abysmal processes of the 737 MAX that are far below the standard in the airplane industry would, relative to most software, be very good. It is just that the problems they need to solve are very, very, very hard and very good does not cut it when lives, not data, are at stake.

Well, Toyota had the sticking gas pedal issue 10 years ago: they did not implement a brake override when the gas pedal was stuck. This was a recommended feature by European manufacturers when they introduced the electronic throttle, apparently Toyota didn't get the memo.

Although I find the GM ignition key issue way worse than Toyota which was an oversight.

Apples to oranges? The scale between AWS and 737s is several orders of magnitude different. Boeing has a critical issue every 200k flights, or let's say 3.8M hours of flight time (assuming all flights are 19h, which they are not). Assume AWS has 1M CPUs total (they have way more than that), if AWS saw a critical CPU bug every 3.8M hours of CPU time they would be having a 737 MAX crisis level every 3.8 hours.
One failure per 3.8M hours would be once per 433 CPU-years, so they probably actually do have somewhere between 10-100x that failure rate for their CPUs given that expected CPU lifetime is probably around 20-30 years. Even using a much more reasonable 2 hours per flight that is still ~45 CPU-years so still within the likely range of expected CPU errors. Also that is a comparison against a system so dangerous that it is unfit for use instead of the actual standard which is once per 50,000,000 flights or ~250x better.

Even ignoring that, I am discussing the uptime of a system using AWS which only guarantees 99.99% uptime for AWS service in any given AWS region and only a 10% refund (which is less than their profit margin) as long as they keep your system up more than 99% of the time. Downtime for a system due to AWS downtime in a region constitutes a critical failure of AWS to deliver expected service. That their lack of service does not result in deaths unlike an airplane is immaterial to a reliability analysis, it only tells us if their critical failures matter and what level of reliability we should require/demand when making reliability-cost tradeoffs. In other words, the probability and costs of failure are not actually related. It is just that costly failures result in more effort being spent on developing mitigations. In the case of airplanes, critical failure in the form of a crash is very costly, so they take great pains to minimize the whole-system risk of that failure mode.

You seem to be taking the entire industry down by painting broad strokes from one incident; yet somehow planes aren't crashing everyday so. Anyway I don't work in the field but what I've read, issues with the 737 MAX were not software related - they were and are design related. They need redundant sensors. Their overall design approach was due to their desire not to have pilots go through additional training and the fact that they didn't have redundant sensors is criminal or a disagree alert standard were criminally negligent decisions in my opinion. Those are also largely system design related decisions; not software engineers.

Here's a quick, high level, run-down:

https://jalopnik.com/heres-everything-boeing-did-to-fix-the-...

"In practice, the MCAS system accepted readings from only a single angle of attack (AOA) sensor. In the event of a bad sensor reading, the MCAS initiated repeated nose-down inputs. The cockpit alarm for AOA disagreement was also an expensive upcharge.

So Boeing made some changes to the MAX and the MCAS system. The MCAS system now has a maximum limit of one nose-down input during a single event of high angle of attack. The limit doesn’t reset if the pilots activate the electric trim switches. Further, an AOA sensor monitor was added to make sure MCAS doesn’t use AOA input if sensors disagree with each other by more than 5.5 degrees. The Flight Control Computer itself also no longer relies on a single sensor. Another important change is with the AOA DISAGREE alert. Previously, this alert was part of an optional AOA Gauge offered by Boeing. Now the AOA DISAGREE alert is always enabled, regardless of whether the airline has the option or not. All these changes are in the FAA summary."

More detail in a Nytimes article of the flaws:

https://www.nytimes.com/interactive/2019/03/29/business/boei...

Airbus also has the Airbus Defense and Space group as well, it’s not just all airplanes :)
Is "move fast and break things" a good culture for airplane manufacturer? Airbus is known for making good software, they earned their reputation by releasing the first fly by wire airliner (a320) in 84, which forced Boeing to go this route with the 777.

Making safety critical software is a totally different world than what is seen on HN. The culture needed is safety culture and it is all about doing boring code, following strict coding rules, doing tons of documentation and analysis prior coding and a doing tons of review of tests. I don't think it will arouse interest here.

Airbus is known to be excellent in airplane software development.

However, this is probably not about the airplane part of Airbus. Like Boeing, Airbus also have huge defense and space divisions.