| You can plan all you'd like, failures happen not necessarily due to poor planning but because in real life, shit happens. Pokemon Go for instance experienced like 50x the amount of traffic they planned for. Secondly, software companies like Microsoft, Google and IBM might know a thing or two about running data centers. Due to economies of scale, these companies are inherently in a better position to supply hardware at scale. > If entire region is down do you think other regions can handle the load. If you think so you're kidding yourself Netflix routinely does just this to test the resilience of their systems. They pick a random AWS region, and they evacuate it. All the traffic is proxied to the other regions and eventually via DNS the traffic is routed entirely to the surviving regions. No interruption of service is experienced by the users. Here's a visualization of Netflix simulating a failure on the US-east-1 region and failing over to US-west-1/US-west-2 https://www.youtube.com/watch?v=KVbTjlZ0sfE The top right node is the one that fails. As the error rate climbs, traffic starts getting proxied over to the surviving nodes, until a DNS switch redirects all traffic to the surviving nodes. Netflix does this monthly, in production. They also run https://github.com/Netflix/SimianArmy on production. The cloud enables fault tolerance, resiliency and graceful degradation. |