Hacker News new | ask | show | jobs
by hexix 4985 days ago
I'm seeing a lot of these type of comment. The thing is, AWS completely crapped out. Don't believe their status updates that make it sound like it was a tiny little area of their data center. It was pretty much the entire zone and then whenever there is an outage affecting an entire zone it brings down global services and even other zones as well.

We had servers in the bad zone and started having load issues. When I went to use the cool cloud features that are made for this, the entire thing completely fell on its face. I couldn't launch new EC2 servers either because the API was so bogged down, or the new zone I was launching in was restricted because of load.

Basically, the thing that nobody keeps in mind when they think it's so cool that you can spin up servers to work around outages is that EVERYONE IS DOING THAT. This is Amazon's entire selling point and when it comes to doing it, it doesn't work!

We were lucky to get some new servers launched before the API pretty much completely went down. They started giving everyone errors saying request limit exceeded. The forums were full of people asking about it.

ELB, Elastic IP, and other services not associated with a single availability zone completely failed. I keep seeing comments saying that if people designed their stuff right, they wouldn't have an issue. That's just completely bull, AWS has serious design flaws and they show up at every outage. It's NOT just people relying on a single zone.

2 comments

Totally Agree. A lot of people don't know this, or substitute alternatives which are not necessarily viable. Among the tenants of reliability is isolation. The nature of Amazon's services is that it isolates at the datacenter level. One should isolate at the level in which they are comfortable taking on failures. Once there is an active dependency, a la EBS, the number of subsystems increase multi-fold and the likelihood of failure & cascading failure dramatically increases.

Where getting a bit from disk to memory used to be: platter -> diskcontroller -> cpu -> memory,

now with SANs & NFS & virtualized block storage, it's: platter -> diskcontroller -> cpu -> memory -> nic -> wire -> switch(es)/router(s)/network configs(human config item) -> wire -> nic -> cpu -> memory.

Not to say that centralized storage doesn't have its benefits, but now the scope of isolation has drastically increased, which when considering the combinatorial possibilities of failure in the prior scenario vs the latter, the latter has a significantly larger chance and mode of failure that is significantly more difficult to programmatically automate failover.

TLDR: With amazon, the scope is isolation is the datacenter. To be on amazon, one must architect and design at the scope of handling failure at the datacenter level, rather than at the host or cluster level.

We didn't have downtime for various reasons but the ELBs we were using failed and the queue of starting instances was too big to see our few ones restarting.

The main systematic issue in EC2 is EBS, take that away and it will almost completely remove downtimes.

The problem with ELBs is that they are themselves EC2 instances and use many of the global services for detecting load, scaling up, etc.. Like all of AWS's value-added services, they are therefore more likely to fail during an outage event, not less likely, as they depend on more services.