| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cle 1964 days ago

> During the incident, AWS engineers were alerted to our packet drops by their own internal monitoring, and increased our TGW capacity manually. By 10:40am PST that change had rolled out across all Availability Zones and our network returned to normal, as did our error rates and latency.

Sounds like AWS knew how to handle it too.

Given how AWS has responded to past events like this, I'd bet there's an internal post-mortem and they'll add mechanisms to fix this scaling bottleneck for everyone.

Although one thing I'm not clear on is if this was really an AWS issue or if Slack hit one of the documented limits of Transit Gateway (such as bandwidth), after which AWS started dropping packets. If that's the case then I don't see what AWS could have done here, other than perhaps have ways to monitor those limits, if they don't already. The details here are a bit fuzzy in the post.