Hacker News new | ask | show | jobs
by 7a1c9427 1202 days ago
> Google's automation systems mitigated this failure by pushing a complete topology snapshot during the next programming cycle. The proper sites were restored and the network converged by 05:05 US/Pacific.

I think this is the most understated part of the whole report. The bad thing happened due to "automated clever thing" and then the system "automagically" mitigated it in ~7 minutes. Likely before a human had even figured out what had gone wrong.

1 comments

How would you otherwise do it? Anything that automatically pushes updates should monitor for rapid increase in errors afterwards and roll back if so. You should do at least that if you are working on a critical system.
Sure, in an ideal world this is how nearly everything would work.

Getting a complex system to a level of maturity where this is feasible to do at scale in real life and actually work well is a respectable and non-trivial achievement.

I don't know if Amazon or Azure are able to confidently and effectively put in such automatic remediation measures globally. My sense is there are humans involved to triage and fix unusual types of outages at every other cloud provider, including the other bigs.

Leaving a comment on a message board saying how things ought to work is one thing (there's nothing wrong with your comment, I like it!); I only want to highlight, bold, and underscore how successfully achieving this level of automatic remediation atop a large and dynamic system is uncommon and noteworthy.

Based on the public RCAs of outages that occurred at Azure, it seems that their recovery processes are largely manual, or manually triggered.