Hacker News new | ask | show | jobs
by asymptotic 1149 days ago
When I worked at AWS there was a similar scenario in eu-west-2. There was a fire in one of the availability zones (AZs). The fire suppression system kicked in and flooded the data center up to ankle or knee height. All the racks were powered off and the building was evacuated for hours (I don't remember the duration of the evacuation) until the water was pumped out.

But for the service team I worked for, our AZ-evacuation story wasn't great at the time and it took us tens of minutes to manually move out of the AZ, but at least there wasn't a customer-visible availability impact. Once we did it was just monitoring and baby-sitting until we got the word to move back in, I think it was 1-2 days later.

If you operate on AWS you work with the assumption that an AZ is a failure domain, and can die at any time. Surprisingly many service teams at AWS still operate services that don't handle AZ failure that well (at the time). But if you operate services in the cloud you have to know what the failure domain is.

1 comments

> urprisingly many service teams at AWS still operate services that don't handle AZ failure that well (at the time)

Ouch, hopefully none of the major services? I recently had to look into this for work (for disaster recovery preparation) and it seemed like ECS, Lambda, S3, DynamoDB and Aurora Serverless (and probably CloudWatch and IAM) all said they handled availability zone failures transparently enough.

I’m familiar with Lambda and DynamoDB. When I left in 2022 they both had strong automated or semi-automated AZ evacuation stories.

I’m not that familiar with S3, but I never noticed any concerns with S3 during an AZ outage. I’m not at all familiar with Aurora Serverless or ECS.

For all AWS services you can always ask AWS Support pointed, specific questions about availability. They usually defer to the service team and they’ll give you their perspective.

Also keep in mind that AWS teams differentiate between the availability of their control and data planes. During an AZ outage you may struggle to create/delete resources until an AZ evacuation is completed internally, but already created resources should always meet the public SLA. That’s why especially for DR I recommend active-active or active-“pilot light”, have everything created in all AZs/regions and don’t need to create resources in your DR plan.

Okay good to know - Lambda seemed to suggest it could handle an availability zone going down without any trouble.

ECS Fargate's default is to distribute task instances across availability zones too but I'm assume if you use EC2 it might not be as straight-forward.

And that makes sense - I remember during the last outage that affected me it was a compute rather than data failure and the running stuff continued fine, just nothing new was getting created.