That would be a bad sign that there’s something wrong with the culture. I would hope for a postmortem that identified flaws that genuinely needed to be fixed.
Those are not mutually exclusive and actually a good idea. You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs. That's scaling and redundancy 101 - not sure why it would be something wrong.
> Those are not mutually exclusive and actually a good idea.
The goals “assuring folks that there is isolation” and “identifying flaws that need to be fixed” are somewhat contrary to each other.
The post-mortem should identify flaws in systems, processes, and thinking. It should not try to assure people that there is isolation when there is evidence to the contrary.
> You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs.
This was a multi-regional failure. So, this specific issue is also an isolation problem, among other things. You will want to ensure that this problem doesn’t happen again but you shouldn’t assure that it won’t.
Reading between the lines, it looks like their maintenance system needed to take down several Borg clusters within a single AZ, and their BGP route reflectors all ran from the same set of logical clusters. They'd tried to set up geo-redundancy by having different BGP speakers across different AZs, but they were all parented by the same set of logical clusters, and the maintenance engine descheduled all of them together. Then the network ran okay ("designed to fail static for short periods of time") until the routes expired, after which routes got withdrawn and traffic blackholed.
They realized the issue within an hour.. unfortunately, since they took down multiple replicas of their network control plane, they lost Paxos primary and had to rebuild configuration.
(Disclaimer: I work in Azure, I just find it fascinating to look at Google's RCAs because failure provides an insight into their architecture and risk engineering.)
A post mortem should always be a place to highlight deficiencies in processes and communicate necessary improvements put into place, not to blame. Blame should only occur if the cadence of outages becomes excessive. Complex systems are tricky, and to err is human.
Disclaimer: Ops/infra engineer in a previous life.
It could be DNS. Azure has had an all-region failure due to a single DNS provider outage. It was possible that same DNS provider's outage was also causing problems for GCE and AWS at the same time.
That wouldn't be at the top of my list. We have "Volumes" for databases and they were inaccessible for like 6 hours. I don't think any DNS is involved in mounting these. But hey, there's always a lot of crap hidden behind the scenes :)