| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by justinsb 4988 days ago
	You need to be in multiple regions to tolerate EC2 outages, not just multiple AZs. Even then, this is only good until AWS's first multi-region failure; this doesn't seem to be an impossible event given EC2's recent track record. Though I can well understand that designing for EC2 region failure is not worth the cost for most systems.

2 comments

ceejayoz 4988 days ago

> Even then, this is only good until AWS's first multi-region failure; this doesn't seem to be an impossible event given EC2's recent track record.

Doesn't everything in their track record indicate that regions are nicely partitioned from each other? Even the biggest region failures they've had have stayed completely isolated to that region.

link

justinsb 4988 days ago

AZs were supposed to be that unit of isolation, then when multiple AZs failed that shifted to be Regions; it seemed like a "blame the victim" mentality to me.

Given that AWS are running the same software across regions and have the same people & processes in place, and further that there's software that runs across regions (e.g. S3), I'd wager it's not long before we have a multi-region outage.

Finally, some of the multi-AZ problems in the past were compounded because as one AZ went down everyone hammered the other AZs, taking out the APIs at least. That's when everyone believed that AZs were isolated. Now that people know that's not the case, those same systems are going to be hammering across multiple regions.

link

joeyi 4988 days ago

Perhaps you misread/misinterpreted the level of isolation that AZs provided.

AZs are physically separate data centers. They are protected from fires, flooding, physical disasters. BUT they do share some common components which allow you to do things like shift EIPs between AZs, snapshots, security groups, etc. (Source: http://aws.amazon.com/ec2/faqs/#How_isolated_are_Availabilit...)

Regions on the other hand, are completely separate installations of every component of the AWS stack. You can verify this because no resources can be shared between Regions (snapshots, groups, EIPs etc). You can also verify this during an outage. IE: When the US-EAST-1 API becomes unresponsive (due to throttling), the US-WEST-1/2 are still available.

link

res0nat0r 4988 days ago

Regions are 100% independent of one another, both physically and also control plane wise. Also code pushes to regions for new features don't ever happen on the same day.

link

justinsb 4988 days ago

Source?

AZs were supposed to be independent; they aren't. Fool me one...

link

res0nat0r 4988 days ago

I used to work on the EC2 team. The regions are wholly independent of one another.

link

justinsb 4988 days ago

I hope you blog more of these practices then. AWS doesn't put this stuff in writing, which is very convenient for them when something goes wrong, but makes it nigh on impossible to build a reliable system on EC2.

I don't think it's an easy problem to solve, but to suggest that the regions won't go down together strikes me as "the Titanic is unsinkable" hubris. I hope the AWS team doesn't share your attitude :-)

link

giulianob 4988 days ago

There were comments during the failure that AWS wasn't properly switching to use the available zones during the outage. That's what I find troubling. You are paying extra for some guaranteed availability and everyone keeps saying thats how you prevent downtime during outages. Then when the times comes it doesnt work?

link

justinsb 4988 days ago

If you were affected by this, I hope you got a big refund (1+ month).

link