Hacker News new | ask | show | jobs
by h02 1418 days ago
Yup, exact same here. All of the multi-AZ failover depends on AWS recognizing that their AZ is having an issue, and they never reported having an issue on any health-check so no failover ever happened. We started being able to make progress when AWS told us which AZ was having issues. It still took some time for us to manually shift away from that AZ (manually promoting ElastiCache replicas to primary, switching RDS clusters around, etc.) because all of the AWS failover functionality did not function as they should have and we were relying on that. Multi-region failover would have made us more fault tolerant but our infrastructure wasn't setup for that yet (besides an RDS failover in a separate region). Here's to hoping we never have a Route53 or global AWS API Gateway failure! Then even multi-region will not do us much good. Perhaps we should have some backup servers on the moon, then in case of nuclear warfare we can still be online via satellite.

P.S. AWS has said they have resolved the issue for almost 2 hours now and we are still having issues with us-east-2a.