Hacker News new | ask | show | jobs
by Veserv 1883 days ago
That is such a bizarre viewpoint from my perspective. The absolute deathtrap that is the 737 MAX had two software-related critical failures in 400,000 flights. That constitutes a whole system per-flight software reliability of 2 in ~400,000 or a ~99.9995%, 5 9s. Obviously that is still unacceptable as that is far below the software standard amongst all commercial airplanes where software has not been implicated in a crash for at least the last 10 years except for the 737 MAX. Even if we include the two 737 MAX crashes into the statistics, the whole system per-flight software reliability of all commercial airplanes over the last decade is at least 2 in ~100,000,000 or ~99.999998% or 7 9s. The standard in airplane software is literally 5000x more reliable than AWS SLA guarantees and 500x the holy grail in server software of 5 9s. Even the 737 MAX is 20x better than the AWS guarantee and 2x more reliable than 5 9s. Airplane software is not bad, we just rightfully expect a lot from systems that lives depend on, so even systems that are better than best-in-class non-safety software are completely unacceptable which may give the impression that they are bad in absolute terms as they fail to live up to our expectations.
2 comments

That’s an interesting way to look at uptime no pun intended

thou I wouldn’t buy a Toyota that exploded every 400,000 trips world wide Or bank with a bank that lost all my money every 400,000 transactions world wide

Indeed, a Toyota with a critical fatality-inducing safety defect every 200,000 trips would be rightfully viewed as a deathtrap. Given that the average trip is probably somewhere around ~30 miles that would be a fatality per 6M miles versus the standard of ~60M miles in the US, or about 10x more dangerous. However, when comparing a car versus airplanes, given that they both fulfill the niche of transportation and are to some degree substitutable, a more reasonable analysis would be fatalities/person-hour or fatalities/person-mile. For fatalities/person-hour the average flight is something like ~2 hours. In the same amount of time 200,000 cars for 2 hours at an average of 40 mph would be ~16M miles, so the 737 MAX is ~4x more dangerous on a person-hour basis than cars. If we go by distance the average flight is ~500 miles, so the 737 MAX had a fatality per 100M person-miles or is ~1.6x safer than driving. That is just how high our standards are with planes that a plane that is viewed as an absolute death machine that is totally unfit for use is safer than its primary alternative for an equivalent distance. A plane that is 100x worse than any other commercial plane is still better than the non-plane alternative on a per-distance basis.

Obviously, this does not excuse their actions as they still made a system at least 100x more dangerous than the standard, but it should give perspective on the difficulty of the problems actually being solved. It is not a bunch of amateurs or below-average engineers who need to adopt basic practices. It is a bunch of highly-skilled professionals developing systems with a level of reliability far beyond what most software developers even think is possible. Even the abysmal processes of the 737 MAX that are far below the standard in the airplane industry would, relative to most software, be very good. It is just that the problems they need to solve are very, very, very hard and very good does not cut it when lives, not data, are at stake.

Well, Toyota had the sticking gas pedal issue 10 years ago: they did not implement a brake override when the gas pedal was stuck. This was a recommended feature by European manufacturers when they introduced the electronic throttle, apparently Toyota didn't get the memo.

Although I find the GM ignition key issue way worse than Toyota which was an oversight.

Apples to oranges? The scale between AWS and 737s is several orders of magnitude different. Boeing has a critical issue every 200k flights, or let's say 3.8M hours of flight time (assuming all flights are 19h, which they are not). Assume AWS has 1M CPUs total (they have way more than that), if AWS saw a critical CPU bug every 3.8M hours of CPU time they would be having a 737 MAX crisis level every 3.8 hours.
One failure per 3.8M hours would be once per 433 CPU-years, so they probably actually do have somewhere between 10-100x that failure rate for their CPUs given that expected CPU lifetime is probably around 20-30 years. Even using a much more reasonable 2 hours per flight that is still ~45 CPU-years so still within the likely range of expected CPU errors. Also that is a comparison against a system so dangerous that it is unfit for use instead of the actual standard which is once per 50,000,000 flights or ~250x better.

Even ignoring that, I am discussing the uptime of a system using AWS which only guarantees 99.99% uptime for AWS service in any given AWS region and only a 10% refund (which is less than their profit margin) as long as they keep your system up more than 99% of the time. Downtime for a system due to AWS downtime in a region constitutes a critical failure of AWS to deliver expected service. That their lack of service does not result in deaths unlike an airplane is immaterial to a reliability analysis, it only tells us if their critical failures matter and what level of reliability we should require/demand when making reliability-cost tradeoffs. In other words, the probability and costs of failure are not actually related. It is just that costly failures result in more effort being spent on developing mitigations. In the case of airplanes, critical failure in the form of a crash is very costly, so they take great pains to minimize the whole-system risk of that failure mode.