|
|
|
|
|
by lucacasonato
1438 days ago
|
|
Our issue here was very much in finding the root cause. Because the failed traffic was “black holed” (TCP connections were being dropped), we had very little information other than “it isn’t working” from the users that reported the issue. This caused us significant headaches in trying to figure out what the commonality between the incident reports of our users was (the geo region). Up until the point this was clear, we were also checking database clusters, DNS configurations, TLS certificates etc to try to isolate the issue. After we managed to successfully isolate the issue we were able to disable the region within 30 minutes, because we had an established protocol for how to do that. Here is a more typical incident update for us: https://deno.com/blog/2022-05-30-outage-post-mortem Part of the issue was also that we did not realize the scope of the issue right at the start of the incident, because our automated monitoring did not catch the dropped traffic. All that is to say: the outage is obviously unacceptable, and sincerely apologize for it. We are working very hard to make sure nothing similar can occur again in the future. |
|
As mentioned, I'm looking forward to continuing to follow Deno's progress and all the best in hardening your devops!