| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by match 5103 days ago

Interesting question, let's try to compare the two events. The 2011 event involved roughly 13% of ebs volumes in the affected zone, including multi-az control plane impact, the 2012 event involved 7% of EC2 instances in the affected zone. These two events are different, since one was a power event and the other a network event, but let's see how they compare in number of impacted volumes. It's not exactly clear how to compare these numbers, but if you assume nearly all EC2 instances (7% were affected, were any EBS servers affected? possibly the same % or none or in between) have at least the boot volume and possible more attached then maybe that's roughly 7-10+% of the volumes in the affected zone. Assuming a respectable growth rate (http://huanliu.wordpress.com/2012/03/13/amazon-data-center-s...) these events may have been around the same size (I'd be curious to hear other arguments for/against this guess).

If you compare the recovery time (ballpark, feel free to break down the timelines in your copious amounts of free time):

2011:

  12:47AM, Apr 21 - Event started, API impaired across all availability zones 

  12:00PM, Apr 21 - API recovered in non-affected zones
                        "Customers also experienced elevated error rates until Noon 
                         PDT on April 21st when attempting to launch new EBS-backed 
                         EC2 instances in Availability Zones other than the affected 
                         zone."

  12:30PM, Apr 22 - Nearly all volumes in affected zone restored
                        "all but about 2.2% of the volumes in the affected
                         Availability Zone were restored by 12:30PM PDT on 
                         April 22nd"

  18:15PM, Apr 23 - API restored for affected zone
                        "At 6:15 PM PDT on April 23rd, API access to EBS resources 
                         was restored in the affected Availability Zone."

2012:

  20:04, July 2 - Some number of racks lose power due to drained UPSs

  21:10, July 2 - API restore
                        "8:04pm PDT to 9:10pm PDT, customers were not able to launch
                         new EC2 instances, create EBS volumes, or attach volumes in
                         any Availability Zone in the US-East-1 Region. At 9:10pm PDT,
                         control plane functionality was restored for the Region."

  02:45, July 3 - Vast majority of volumes restored to customers
                        "By 2:45am PDT, 90% of outstanding volumes had been turned
                         over to customers."

http://aws.amazon.com/message/65648/ http://aws.amazon.com/message/67457/

Yes I'm painting with broad strokes here, and feel free to argue the details (we always do). But I do think this at least shows some improvement to answer the previous poster's question.

[edits to try to fix the formatting, fixed mis-paste]

1 comments

matt2000 5102 days ago

This is an awesome answer, thanks so much for doing the research here. So although the time of outage was much shorter this time, what I'm concerned about was that they seem to have violated their "AZs are totally separate" claim again, which would point to a still-lurking fundamental problem. Happy to be corrected here if I'm wrong.

link