| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lucacasonato 1438 days ago

Our issue here was very much in finding the root cause. Because the failed traffic was “black holed” (TCP connections were being dropped), we had very little information other than “it isn’t working” from the users that reported the issue. This caused us significant headaches in trying to figure out what the commonality between the incident reports of our users was (the geo region). Up until the point this was clear, we were also checking database clusters, DNS configurations, TLS certificates etc to try to isolate the issue.

After we managed to successfully isolate the issue we were able to disable the region within 30 minutes, because we had an established protocol for how to do that.

Here is a more typical incident update for us: https://deno.com/blog/2022-05-30-outage-post-mortem

Part of the issue was also that we did not realize the scope of the issue right at the start of the incident, because our automated monitoring did not catch the dropped traffic.

All that is to say: the outage is obviously unacceptable, and sincerely apologize for it. We are working very hard to make sure nothing similar can occur again in the future.

1 comments

alluro2 1438 days ago

Thanks for the insight - I definitely wasn't trying to dump on the team or handling of the issue - really just understand better so I have more awareness and can hopefully help my team (as a young CTO) be more prepared for different types of challenges.

As mentioned, I'm looking forward to continuing to follow Deno's progress and all the best in hardening your devops!

link