| HN Mirror

No, not really. How would a normal software engineer evaluate the processes if stated? There is no frame of reference for what is effective or not if you do not trace to quantitative outcomes. Like, if I said: "The industry uses an autoregressive failure model with 175 billion parameters, 10x more than any previous non-sparse failure model." would that mean anything (it does not, I just replaced "language" with "failure" in the GPT-3 abstract). How can anybody tell what is an effective or ineffective process if they do not trace to an actual outcome? 10x times as many tests and code mean nothing if they test nothing of value. Redundancies are irrelevant if they are completely correlated. Regulations mean nothing if they encode ineffective or meaningless techniques (look at security standards which require antiviruses). One of the only ways to compare processes and not be tricked by fancy words, especially as a non-expert, is to look and compare actual outcomes.

I somewhat agree that the metric I chose is somewhat sloppy, but you can afford to be sloppy when you are comparing things with such disparate outcomes. Sure, maybe we are not comparing a 1 story house to a 50 story skyscraper, it is only a 30 story skyscraper, but that has little impact on the fact that they are fundamentally different and to declare that they are even remotely comparable is a massive category error.

I, however, disagree that "uptime" is a nonsense metric, though there are absolutely better ones. "Uptime" in this context means duration/probability of critical operational failure which is an extremely relevant metric. That AWS does not result in fatalities during critical operational failure has no bearing on whether critical operational failure occurred or not, it just means that it matters less. A valid quibble is that I am using crashes as a proxy for failure which discounts critical software failures that did not cause critical operational failure due to non-software redundancy, but again, the outcomes are so disparate it beggars belief that this would bridge the gap.

As for aircraft computers being rebooted frequently, true. So? I am comparing full system reliability during operation, not individual components. It is not like individual AWS servers run indefinitely; they are rebooted frequently, but the system as a whole stays operational due to redundancy and migration.

The reliability estimate does account for the bug. The bug did not cause a critical operational failure. It could cause a critical operational failure in an extremely unlikely case if it remained undetected and no measures were taken to avoid or correct for it. However, it was detected and countermeasures have been put into place, so the processes in place continue to achieve their intended goal of preventing critical operational failure. So, the outcome-based estimate continues to be accurate.

Just to be clear, an outcome-based estimate is not perfect. By its nature, it only looks at the past, so has no true predictive power. You can not use an outcome-based estimate to predict the effects of process changes. However, it is a relatively unbiased way of evaluating if prior processes were effective which we can use to inform us which processes of the past were actually effective or not and the effects of process changes.