|
|
|
|
|
by elteto
1880 days ago
|
|
Apples to oranges? The scale between AWS and 737s is several orders of magnitude different. Boeing has a critical issue every 200k flights, or let's say 3.8M hours of flight time (assuming all flights are 19h, which they are not). Assume AWS has 1M CPUs total (they have way more than that), if AWS saw a critical CPU bug every 3.8M hours of CPU time they would be having a 737 MAX crisis level every 3.8 hours. |
|
Even ignoring that, I am discussing the uptime of a system using AWS which only guarantees 99.99% uptime for AWS service in any given AWS region and only a 10% refund (which is less than their profit margin) as long as they keep your system up more than 99% of the time. Downtime for a system due to AWS downtime in a region constitutes a critical failure of AWS to deliver expected service. That their lack of service does not result in deaths unlike an airplane is immaterial to a reliability analysis, it only tells us if their critical failures matter and what level of reliability we should require/demand when making reliability-cost tradeoffs. In other words, the probability and costs of failure are not actually related. It is just that costly failures result in more effort being spent on developing mitigations. In the case of airplanes, critical failure in the form of a crash is very costly, so they take great pains to minimize the whole-system risk of that failure mode.