|
|
|
|
|
by asymptotic
1149 days ago
|
|
When I worked at AWS there was a similar scenario in eu-west-2. There was a fire in one of the availability zones (AZs). The fire suppression system kicked in and flooded the data center up to ankle or knee height. All the racks were powered off and the building was evacuated for hours (I don't remember the duration of the evacuation) until the water was pumped out. But for the service team I worked for, our AZ-evacuation story wasn't great at the time and it took us tens of minutes to manually move out of the AZ, but at least there wasn't a customer-visible availability impact. Once we did it was just monitoring and baby-sitting until we got the word to move back in, I think it was 1-2 days later. If you operate on AWS you work with the assumption that an AZ is a failure domain, and can die at any time. Surprisingly many service teams at AWS still operate services that don't handle AZ failure that well (at the time). But if you operate services in the cloud you have to know what the failure domain is. |
|
Ouch, hopefully none of the major services? I recently had to look into this for work (for disaster recovery preparation) and it seemed like ECS, Lambda, S3, DynamoDB and Aurora Serverless (and probably CloudWatch and IAM) all said they handled availability zone failures transparently enough.