| HN Mirror

One failure per 3.8M hours would be once per 433 CPU-years, so they probably actually do have somewhere between 10-100x that failure rate for their CPUs given that expected CPU lifetime is probably around 20-30 years. Even using a much more reasonable 2 hours per flight that is still ~45 CPU-years so still within the likely range of expected CPU errors. Also that is a comparison against a system so dangerous that it is unfit for use instead of the actual standard which is once per 50,000,000 flights or ~250x better.

Even ignoring that, I am discussing the uptime of a system using AWS which only guarantees 99.99% uptime for AWS service in any given AWS region and only a 10% refund (which is less than their profit margin) as long as they keep your system up more than 99% of the time. Downtime for a system due to AWS downtime in a region constitutes a critical failure of AWS to deliver expected service. That their lack of service does not result in deaths unlike an airplane is immaterial to a reliability analysis, it only tells us if their critical failures matter and what level of reliability we should require/demand when making reliability-cost tradeoffs. In other words, the probability and costs of failure are not actually related. It is just that costly failures result in more effort being spent on developing mitigations. In the case of airplanes, critical failure in the form of a crash is very costly, so they take great pains to minimize the whole-system risk of that failure mode.