Hacker News new | ask | show | jobs
by Veserv 2210 days ago
You are correct, but airplane companies already do that for the most part and much much more.

The difference in reliability between normal software and airplane software is so vast that "best practices" from normal software can not be applied to airplane software since that would be gross criminal negligence. To explain, in the 10 years prior to the 737-MAX problems there were 50,000,000 flights and software was not implicated in a single passenger air fatality. The average flight is ~5,000 KM which is ~4-5 hours. So, in ~250,000,000 flight-hours, there were two crashes due to software. A plane takes ~3 minutes to fall from cruising altitude, so we can model this as a downtime of 6 minutes per 250,000,000 hours which gives us an downtime of 1 in 2,500,000,000 or a 99.99999996% uptime (yes, that is 9 9s). In contrast, I think most software people would agree that AWS is high quality. The AWS SLA specifies a 99.99% uptime (1 in 10,000 downtime). So, by this metric, airplane software is 250,000x more reliable than normal high quality software.

The point of this is that the standard for airplanes is almost inconceivably high compared to normal software. To think that they are incompetent or suggest that all they need to do is adopt X or Y common-sense/best-practice is a gross misunderstanding of what is being done and what needs to be done to improve. It would be like someone trying to tell a civil engineer making a 50-story skyscraper that they really need to adopt high quality wood construction techniques from makers of doghouses. To actually improve it, you need to consider practices 250,000x better than "best practices" and go from there.

To put it another way, the solutions are actually really really good, unfortunately the problems are really really really really hard.

2 comments

Not to detract from your point that aeronautical industry software is reliable (it is), but the 737 MAXes that crashed were all new planes. There wasn't even 24 months between the first delivery of a MAX to the model being grounded.

The issues with the MAX were also clearly preventable and there were multiple failures of the systems (regulators, internal reviews, etc.) that were in place to catch these kinds of issues.

But as you point out, the aeronautical industry has an excellent track record for software reliability, if you evaluate reliability by hull losses. By other metrics, it's a bit more debatable (eg. the integer overflow for Dreamliners such that they need to be restarted at least every 248 days), but still keeps people moving safely.

Yes. I included the MAX because otherwise the software-related fatalities over the last 10 years is 0. If you do just the MAX, the low end in terms of flights is ~200,000 with an average of 3 hours per flight. Using the same time basis above, that is 1 in 6,000,000 or 99.99998% uptime which is 600x better than AWS by my previously used metric. The software of an unconscionable deathtrap is 600x better than extremely high quality server software.

My primary point is that many people look at these failures and incorrectly conclude that the processes in place are objectively terrible and below average. This leads to them discounting the processes in these systems in favor of policies from vastly less reliable systems that they think are quality-focused or "best practices" because they, fairly, think "bad" in a safety-critical context means the same as regular "bad", so regular "amazing" is clearly better. In truth, "unconscionable deathtrap" and "gross criminal negligence" in the airplane world is more of a synonym for "amazing beyond belief" in the rest of the software industry. The correct takeaway is understanding that regular "amazing" is actually orders of magnitude worse than "unconscionable deathtrap" and is thus completely inadequate for the job. As a corollary, if you do not think you are doing "way better than amazing" you are probably not doing an adequate job in these contexts.

To reiterate, the solutions are really really good, unfortunately the problems are really really really really hard.

I do totally agree with your larger points, but these numbers just don’t make any sense, and analysis like this could do unintended damage to your otherwise good points. Would it perhaps be better to cite the industry testing practices and procedures, the volume of testing, the regulations, training, feedback loop, redundancies, and all the other safety efforts behind airline software?

Uptime is not a comparable metric in any way. Aircraft computers often reboot every flight or every day. AWS downtimes don’t typically result in fatalities. The fall time of the 737 MAX before it impacts isn’t ‘downtime’, and simply cannot be used to summarize the reliability of aviation software as a whole. Arriving at 250000x this way makes it a meaningless number, and you didn’t account for the bug in the linked article in your reliability estimate at all.

No, not really. How would a normal software engineer evaluate the processes if stated? There is no frame of reference for what is effective or not if you do not trace to quantitative outcomes. Like, if I said: "The industry uses an autoregressive failure model with 175 billion parameters, 10x more than any previous non-sparse failure model." would that mean anything (it does not, I just replaced "language" with "failure" in the GPT-3 abstract). How can anybody tell what is an effective or ineffective process if they do not trace to an actual outcome? 10x times as many tests and code mean nothing if they test nothing of value. Redundancies are irrelevant if they are completely correlated. Regulations mean nothing if they encode ineffective or meaningless techniques (look at security standards which require antiviruses). One of the only ways to compare processes and not be tricked by fancy words, especially as a non-expert, is to look and compare actual outcomes.

I somewhat agree that the metric I chose is somewhat sloppy, but you can afford to be sloppy when you are comparing things with such disparate outcomes. Sure, maybe we are not comparing a 1 story house to a 50 story skyscraper, it is only a 30 story skyscraper, but that has little impact on the fact that they are fundamentally different and to declare that they are even remotely comparable is a massive category error.

I, however, disagree that "uptime" is a nonsense metric, though there are absolutely better ones. "Uptime" in this context means duration/probability of critical operational failure which is an extremely relevant metric. That AWS does not result in fatalities during critical operational failure has no bearing on whether critical operational failure occurred or not, it just means that it matters less. A valid quibble is that I am using crashes as a proxy for failure which discounts critical software failures that did not cause critical operational failure due to non-software redundancy, but again, the outcomes are so disparate it beggars belief that this would bridge the gap.

As for aircraft computers being rebooted frequently, true. So? I am comparing full system reliability during operation, not individual components. It is not like individual AWS servers run indefinitely; they are rebooted frequently, but the system as a whole stays operational due to redundancy and migration.

The reliability estimate does account for the bug. The bug did not cause a critical operational failure. It could cause a critical operational failure in an extremely unlikely case if it remained undetected and no measures were taken to avoid or correct for it. However, it was detected and countermeasures have been put into place, so the processes in place continue to achieve their intended goal of preventing critical operational failure. So, the outcome-based estimate continues to be accurate.

Just to be clear, an outcome-based estimate is not perfect. By its nature, it only looks at the past, so has no true predictive power. You can not use an outcome-based estimate to predict the effects of process changes. However, it is a relatively unbiased way of evaluating if prior processes were effective which we can use to inform us which processes of the past were actually effective or not and the effects of process changes.