Hacker News new | ask | show | jobs
by justinsb 4988 days ago
AZs were supposed to be that unit of isolation, then when multiple AZs failed that shifted to be Regions; it seemed like a "blame the victim" mentality to me.

Given that AWS are running the same software across regions and have the same people & processes in place, and further that there's software that runs across regions (e.g. S3), I'd wager it's not long before we have a multi-region outage.

Finally, some of the multi-AZ problems in the past were compounded because as one AZ went down everyone hammered the other AZs, taking out the APIs at least. That's when everyone believed that AZs were isolated. Now that people know that's not the case, those same systems are going to be hammering across multiple regions.

2 comments

Perhaps you misread/misinterpreted the level of isolation that AZs provided.

AZs are physically separate data centers. They are protected from fires, flooding, physical disasters. BUT they do share some common components which allow you to do things like shift EIPs between AZs, snapshots, security groups, etc. (Source: http://aws.amazon.com/ec2/faqs/#How_isolated_are_Availabilit...)

Regions on the other hand, are completely separate installations of every component of the AWS stack. You can verify this because no resources can be shared between Regions (snapshots, groups, EIPs etc). You can also verify this during an outage. IE: When the US-EAST-1 API becomes unresponsive (due to throttling), the US-WEST-1/2 are still available.

Regions are 100% independent of one another, both physically and also control plane wise. Also code pushes to regions for new features don't ever happen on the same day.
Source?

AZs were supposed to be independent; they aren't. Fool me one...

I used to work on the EC2 team. The regions are wholly independent of one another.
I hope you blog more of these practices then. AWS doesn't put this stuff in writing, which is very convenient for them when something goes wrong, but makes it nigh on impossible to build a reliable system on EC2.

I don't think it's an easy problem to solve, but to suggest that the regions won't go down together strikes me as "the Titanic is unsinkable" hubris. I hope the AWS team doesn't share your attitude :-)