|
|
|
|
|
by viraptor
1438 days ago
|
|
> is spontaneously met by my team as absolute emergency and typically fixed in < 10 minutes Unless they had the "route the whole region over another one" in their prepared and practiced DR procedure, it would take any team a significant time to get that planned, approved, implemented and tested. If you're running something at tens of services scale and recovered in 10min, you're extremely lucky. I'd suggest that if you don't have risks on your list that will take hours to resolve, your list is not complete. |
|
One alleviating circumstance is that, running on AWS, a big portion of such issues (ones that would take a lot of time to resolve) would come from wider AWS outages - when there's significant leeway - the old adage that customers / big part of web would have bigger issues than us being down if an entire AWS region (or multiple) is down.
In Deno's case, most of "those" parts are self-managed and surely much more difficult to keep running reliably.