|
|
|
|
|
by jread
538 days ago
|
|
> Hardware can fail for all kinds of reasons Complex cloud infra can also fail for all kinds of reasons, and they are often harder to troubleshoot than a hardware failure. My experience with server grade hardware in a reliable colo with a good uplink is it's generally an extremely reliable combination. |
|
Cloud VMs fail from either the instance itself not coming back online, or an EBS failure, or some other az-wide or region-wide failure that affects networking or control plane. It's very rare, but I have seen it happen - twice, across more than a thousand AWS accounts in 10 years. But even when it does happen, you can just spin up a new instance, restoring from a snapshot or backup. It's ridiculously easier to recover than dealing with an on-prem hardware failure, and actually reliable, as there's always capacity [I guess barring GPU-heavy instances].
"Server grade hardware in a reliable colo with good uplink" literally failed on my company last week, went hard down, couldn't get it back up. Not only that server but the backup server too. 3 day outage for one of the company's biggest products. But I'm sure you'll claim my real world issue is somehow invalid. If we had just been "more perfect", used "better hardware", "a better colo", or had "better people", nothing bad would have happened.