"but we have lost machines, connectivity, seen kernel panics, EBS failures, SSD failures, etc., your usual day in AWS " <=== This I wish more people realized that is a day to day reality if you are in AWS at scale.
Basically once a system is complex enough some part if it is always broken. The software must be designed from the assumption that the system is never running flawlessly.
I seem to think that cloud providers are particular opaque about small glitches (i.e. they aren't going to tell you that a router or switch was rebooted for maintenance if it comes back right away and you can email support and it's always the same response: "it's working right now") :)
I only have experience with AWS and on prem and high quality colo like Equinix. Possibly due to reduced complexity and having full control over networking setup but significantly fewer issues vs AWS.
Already playing with it but FoundationDB is used for production Petabyte scale deployments, and the whole deterministic simulation thing for testing is really reassuring as far as bugs/stability. I am guessing with Apple's resources that approach was taken to a whole new level after the acquisition?
https://cdn.chrisshort.net/How-Complex-Systems-Fail.pdf
Basically once a system is complex enough some part if it is always broken. The software must be designed from the assumption that the system is never running flawlessly.