|
|
|
|
|
by dahart
2209 days ago
|
|
I do totally agree with your larger points, but these numbers just don’t make any sense, and analysis like this could do unintended damage to your otherwise good points. Would it perhaps be better to cite the industry testing practices and procedures, the volume of testing, the regulations, training, feedback loop, redundancies, and all the other safety efforts behind airline software? Uptime is not a comparable metric in any way. Aircraft computers often reboot every flight or every day. AWS downtimes don’t typically result in fatalities. The fall time of the 737 MAX before it impacts isn’t ‘downtime’, and simply cannot be used to summarize the reliability of aviation software as a whole. Arriving at 250000x this way makes it a meaningless number, and you didn’t account for the bug in the linked article in your reliability estimate at all. |
|
I somewhat agree that the metric I chose is somewhat sloppy, but you can afford to be sloppy when you are comparing things with such disparate outcomes. Sure, maybe we are not comparing a 1 story house to a 50 story skyscraper, it is only a 30 story skyscraper, but that has little impact on the fact that they are fundamentally different and to declare that they are even remotely comparable is a massive category error.
I, however, disagree that "uptime" is a nonsense metric, though there are absolutely better ones. "Uptime" in this context means duration/probability of critical operational failure which is an extremely relevant metric. That AWS does not result in fatalities during critical operational failure has no bearing on whether critical operational failure occurred or not, it just means that it matters less. A valid quibble is that I am using crashes as a proxy for failure which discounts critical software failures that did not cause critical operational failure due to non-software redundancy, but again, the outcomes are so disparate it beggars belief that this would bridge the gap.
As for aircraft computers being rebooted frequently, true. So? I am comparing full system reliability during operation, not individual components. It is not like individual AWS servers run indefinitely; they are rebooted frequently, but the system as a whole stays operational due to redundancy and migration.
The reliability estimate does account for the bug. The bug did not cause a critical operational failure. It could cause a critical operational failure in an extremely unlikely case if it remained undetected and no measures were taken to avoid or correct for it. However, it was detected and countermeasures have been put into place, so the processes in place continue to achieve their intended goal of preventing critical operational failure. So, the outcome-based estimate continues to be accurate.
Just to be clear, an outcome-based estimate is not perfect. By its nature, it only looks at the past, so has no true predictive power. You can not use an outcome-based estimate to predict the effects of process changes. However, it is a relatively unbiased way of evaluating if prior processes were effective which we can use to inform us which processes of the past were actually effective or not and the effects of process changes.