|
|
|
|
|
by throwaway2016a
1429 days ago
|
|
Architect here. We had an outage and we have a very complete architecture. The issue is, the services were still reachable via internal health checks. So instead of taking the effected servers out of service they stayed in. We had to resolve it by manually shutting down all the servers in the affected AZ. Which is normally not needed. There are of course a lot of companies that aren't architected with multi-AZ at all (or choose not be be). Those companies are having an even worse time right now. But because the servers generally still appeared healthy, this can effect some well architected apps also. Only reason we knew to shut them down at all was because AWS told us the exact AZ in their status update. We were beginning the processes of pinging each one individually to try to find them (because again, all the health checks were fine). |
|
P.S. AWS has said they have resolved the issue for almost 2 hours now and we are still having issues with us-east-2a.