| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nijave 1429 days ago

As some others have alluded to, it seems common AWS services (the ones you rely on to manage multi-AZ traffic like ALBs and Route53) spike in error rate and nose dive in response time so it becomes difficult to fail things over. On top of that, services like RDS that run active hot standby then rely on those to fail over so it's difficult to get the DB to actually fail over.

I suspect, behind the scenes, AWS fails to absorb the massive influx in requests and network traffic as AZs shift around.

I would think regions with more AZs (like us-east-1) would handle an AZ failure better since there's more AZs to spread the load across

What's more surprising, imo, is the large apps like New Relic and Zoom that you'd expect to be resilient (multi region/cloud) taking a hit