|
|
|
|
|
by hexix
4985 days ago
|
|
I'm seeing a lot of these type of comment. The thing is, AWS completely crapped out. Don't believe their status updates that make it sound like it was a tiny little area of their data center. It was pretty much the entire zone and then whenever there is an outage affecting an entire zone it brings down global services and even other zones as well. We had servers in the bad zone and started having load issues. When I went to use the cool cloud features that are made for this, the entire thing completely fell on its face. I couldn't launch new EC2 servers either because the API was so bogged down, or the new zone I was launching in was restricted because of load. Basically, the thing that nobody keeps in mind when they think it's so cool that you can spin up servers to work around outages is that EVERYONE IS DOING THAT. This is Amazon's entire selling point and when it comes to doing it, it doesn't work! We were lucky to get some new servers launched before the API pretty much completely went down. They started giving everyone errors saying request limit exceeded. The forums were full of people asking about it. ELB, Elastic IP, and other services not associated with a single availability zone completely failed. I keep seeing comments saying that if people designed their stuff right, they wouldn't have an issue. That's just completely bull, AWS has serious design flaws and they show up at every outage. It's NOT just people relying on a single zone. |
|
Where getting a bit from disk to memory used to be: platter -> diskcontroller -> cpu -> memory,
now with SANs & NFS & virtualized block storage, it's: platter -> diskcontroller -> cpu -> memory -> nic -> wire -> switch(es)/router(s)/network configs(human config item) -> wire -> nic -> cpu -> memory.
Not to say that centralized storage doesn't have its benefits, but now the scope of isolation has drastically increased, which when considering the combinatorial possibilities of failure in the prior scenario vs the latter, the latter has a significantly larger chance and mode of failure that is significantly more difficult to programmatically automate failover.
TLDR: With amazon, the scope is isolation is the datacenter. To be on amazon, one must architect and design at the scope of handling failure at the datacenter level, rather than at the host or cluster level.