| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by alluro2 1438 days ago

I don't mean anything bad to Deno's team (I'm very partial to what they're building), but I'm rather surprised whenever a widely-publicized service has an outage that lasts hours or more than 24h. I'm genuinely curious to understand whether it's typically due to complexity of infrastructure and how hard it is to find route causes, how long it takes to redirect traffic / patch temporarily when the cause is found, or is it due to attitude where it's considered normal for these things to happen, and to take time to solve step by step.

Our services are of what I consider medium complexity (~70 services, ~10 different "layers" of logic, db, caching, load balancing etc, AWS, mostly self-managed centralized logging and monitoring) but still quite low-volume (< 100 requests / second), and any more serious issue (let alone outage) is spontaneously met by my team as absolute emergency and typically fixed in < 10 minutes.

We're very modestly funded compared to Deno (in this example) and the team is small...

Not sure whether that changes with traffic volume, complexity, team size, or is more primarily attitude-based and should continue to be cultivated.

3 comments

lucacasonato 1438 days ago

Our issue here was very much in finding the root cause. Because the failed traffic was “black holed” (TCP connections were being dropped), we had very little information other than “it isn’t working” from the users that reported the issue. This caused us significant headaches in trying to figure out what the commonality between the incident reports of our users was (the geo region). Up until the point this was clear, we were also checking database clusters, DNS configurations, TLS certificates etc to try to isolate the issue.

After we managed to successfully isolate the issue we were able to disable the region within 30 minutes, because we had an established protocol for how to do that.

Here is a more typical incident update for us: https://deno.com/blog/2022-05-30-outage-post-mortem

Part of the issue was also that we did not realize the scope of the issue right at the start of the incident, because our automated monitoring did not catch the dropped traffic.

All that is to say: the outage is obviously unacceptable, and sincerely apologize for it. We are working very hard to make sure nothing similar can occur again in the future.

link

alluro2 1438 days ago

Thanks for the insight - I definitely wasn't trying to dump on the team or handling of the issue - really just understand better so I have more awareness and can hopefully help my team (as a young CTO) be more prepared for different types of challenges.

As mentioned, I'm looking forward to continuing to follow Deno's progress and all the best in hardening your devops!

link

viraptor 1438 days ago

> is spontaneously met by my team as absolute emergency and typically fixed in < 10 minutes

Unless they had the "route the whole region over another one" in their prepared and practiced DR procedure, it would take any team a significant time to get that planned, approved, implemented and tested.

If you're running something at tens of services scale and recovered in 10min, you're extremely lucky. I'd suggest that if you don't have risks on your list that will take hours to resolve, your list is not complete.

link

alluro2 1438 days ago

That's a fair point and a good suggestion to consider.

One alleviating circumstance is that, running on AWS, a big portion of such issues (ones that would take a lot of time to resolve) would come from wider AWS outages - when there's significant leeway - the old adage that customers / big part of web would have bigger issues than us being down if an entire AWS region (or multiple) is down.

In Deno's case, most of "those" parts are self-managed and surely much more difficult to keep running reliably.

link

jdlshore 1438 days ago

I’m always curious to learn about why people create complex architectures. It’s off-topic, but why so much complexity for such a low volume?

link

donavanm 1437 days ago

A lot of this can be from (un)intentionally trying to maintain separation of responsibilities between different teams or developers. Decoupling, interfaces, etc all add up and pretty soon you start building based on what's already done vs where you originally intended to go. And I don't think that's a poor choice; nine women can't have a baby in one month, but they can have nine in nine months (to butcher an old saying).

link