|
|
|
|
|
by alluro2
1438 days ago
|
|
I don't mean anything bad to Deno's team (I'm very partial to what they're building), but I'm rather surprised whenever a widely-publicized service has an outage that lasts hours or more than 24h. I'm genuinely curious to understand whether it's typically due to complexity of infrastructure and how hard it is to find route causes, how long it takes to redirect traffic / patch temporarily when the cause is found, or is it due to attitude where it's considered normal for these things to happen, and to take time to solve step by step. Our services are of what I consider medium complexity (~70 services, ~10 different "layers" of logic, db, caching, load balancing etc, AWS, mostly self-managed centralized logging and monitoring) but still quite low-volume (< 100 requests / second), and any more serious issue (let alone outage) is spontaneously met by my team as absolute emergency and typically fixed in < 10 minutes. We're very modestly funded compared to Deno (in this example) and the team is small... Not sure whether that changes with traffic volume, complexity, team size, or is more primarily attitude-based and should continue to be cultivated. |
|
After we managed to successfully isolate the issue we were able to disable the region within 30 minutes, because we had an established protocol for how to do that.
Here is a more typical incident update for us: https://deno.com/blog/2022-05-30-outage-post-mortem
Part of the issue was also that we did not realize the scope of the issue right at the start of the incident, because our automated monitoring did not catch the dropped traffic.
All that is to say: the outage is obviously unacceptable, and sincerely apologize for it. We are working very hard to make sure nothing similar can occur again in the future.