Hacker News new | ask | show | jobs
by justinsb 4988 days ago
You need to be in multiple regions to tolerate EC2 outages, not just multiple AZs. Even then, this is only good until AWS's first multi-region failure; this doesn't seem to be an impossible event given EC2's recent track record. Though I can well understand that designing for EC2 region failure is not worth the cost for most systems.
2 comments

> Even then, this is only good until AWS's first multi-region failure; this doesn't seem to be an impossible event given EC2's recent track record.

Doesn't everything in their track record indicate that regions are nicely partitioned from each other? Even the biggest region failures they've had have stayed completely isolated to that region.

AZs were supposed to be that unit of isolation, then when multiple AZs failed that shifted to be Regions; it seemed like a "blame the victim" mentality to me.

Given that AWS are running the same software across regions and have the same people & processes in place, and further that there's software that runs across regions (e.g. S3), I'd wager it's not long before we have a multi-region outage.

Finally, some of the multi-AZ problems in the past were compounded because as one AZ went down everyone hammered the other AZs, taking out the APIs at least. That's when everyone believed that AZs were isolated. Now that people know that's not the case, those same systems are going to be hammering across multiple regions.

Perhaps you misread/misinterpreted the level of isolation that AZs provided.

AZs are physically separate data centers. They are protected from fires, flooding, physical disasters. BUT they do share some common components which allow you to do things like shift EIPs between AZs, snapshots, security groups, etc. (Source: http://aws.amazon.com/ec2/faqs/#How_isolated_are_Availabilit...)

Regions on the other hand, are completely separate installations of every component of the AWS stack. You can verify this because no resources can be shared between Regions (snapshots, groups, EIPs etc). You can also verify this during an outage. IE: When the US-EAST-1 API becomes unresponsive (due to throttling), the US-WEST-1/2 are still available.

Regions are 100% independent of one another, both physically and also control plane wise. Also code pushes to regions for new features don't ever happen on the same day.
Source?

AZs were supposed to be independent; they aren't. Fool me one...

I used to work on the EC2 team. The regions are wholly independent of one another.
I hope you blog more of these practices then. AWS doesn't put this stuff in writing, which is very convenient for them when something goes wrong, but makes it nigh on impossible to build a reliable system on EC2.

I don't think it's an easy problem to solve, but to suggest that the regions won't go down together strikes me as "the Titanic is unsinkable" hubris. I hope the AWS team doesn't share your attitude :-)

There were comments during the failure that AWS wasn't properly switching to use the available zones during the outage. That's what I find troubling. You are paying extra for some guaranteed availability and everyone keeps saying thats how you prevent downtime during outages. Then when the times comes it doesnt work?
If you were affected by this, I hope you got a big refund (1+ month).