Hacker News new | ask | show | jobs
by throwaway2016a 1429 days ago
Architect here. We had an outage and we have a very complete architecture. The issue is, the services were still reachable via internal health checks. So instead of taking the effected servers out of service they stayed in.

We had to resolve it by manually shutting down all the servers in the affected AZ. Which is normally not needed.

There are of course a lot of companies that aren't architected with multi-AZ at all (or choose not be be). Those companies are having an even worse time right now. But because the servers generally still appeared healthy, this can effect some well architected apps also.

Only reason we knew to shut them down at all was because AWS told us the exact AZ in their status update. We were beginning the processes of pinging each one individually to try to find them (because again, all the health checks were fine).

3 comments

Yup, exact same here. All of the multi-AZ failover depends on AWS recognizing that their AZ is having an issue, and they never reported having an issue on any health-check so no failover ever happened. We started being able to make progress when AWS told us which AZ was having issues. It still took some time for us to manually shift away from that AZ (manually promoting ElastiCache replicas to primary, switching RDS clusters around, etc.) because all of the AWS failover functionality did not function as they should have and we were relying on that. Multi-region failover would have made us more fault tolerant but our infrastructure wasn't setup for that yet (besides an RDS failover in a separate region). Here's to hoping we never have a Route53 or global AWS API Gateway failure! Then even multi-region will not do us much good. Perhaps we should have some backup servers on the moon, then in case of nuclear warfare we can still be online via satellite.

P.S. AWS has said they have resolved the issue for almost 2 hours now and we are still having issues with us-east-2a.

Which internal health checks are you referring to?
Both the EC2 instance health and our HTTP health checks. If either of those failed the server would have been removed from the load balancer, but they didn't fail.

Only the external health checks that hit the system from an outside service were failing. And because those spread out the load across the AZs, only a fraction of them were failing and no good way to tell the pattern of failure.

I did have some Kubernetes pods become unhealthy but only because they relied on making calls to servers that were in a different AZ.

that tracks with our experience as well