> AZs make partitioning applications for high availability easy. If an application is partitioned across AZs, companies are better isolated and protected from issues such as power outages, lightning strikes, tornadoes, earthquakes, and more. AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.
AWS has similarly suffered outages from an entire datacenter being taken out like this. No one is immune. If you want true fault-tolerance you need to be multi-regional (everyone says as much), ideally, multi-continental.
europe-west9 is the only large Google datacenter in France afaik. Building more would cost lots more money, and it seems like the market isn't there for it. Workloads that require data locality in France are presumably suffering the most. And there are knock-on effects on other datacenters from losing an entire huge chunk of capacity like this.
Eh source for that. AWS has had issues where a single Zone caused such a lack of capacity in the region that some multi-zone services degraded to the point of a domino fail-over. However I've not heard of any AWS event where a fire/flood in AZ A also caused a fire/flood in AZ B.
But does it really matter that the incident is a flood or a cascading software failure if the likelihood and severity is the same?
Being in the same building is an "implementation detail" from a customer perspective, what matters is the consequences of this decision.
For example, maybe this decision allows for better network connectivity at a lower cost for inter-zones traffic, while, on the other hand, not protecting against some classes of risks.
In the end, you can have a similar multi-zone outage keeping the region down for an extended period of time just because of a bad network config push (see the massive facebook outage in 2021). As a customer, I don't care if it's a flood or a network outage.
Imho, what matters the most is a clear documentation of how these abstractions work for users and the corresponding contractual agreements (costs, SLAs, etc). Users can thus decide if they are ready to pay the price of protecting themselves against an extended outage impacting a single region.
The MTTR for outages caused by physical damage is way higher, and resiliency against physical disasters is a major selling point of availability zones as a fault container.
Hosting every zone of your region (if that's actually the case here) in the same building is simply negligent.
Besides the obvious risks like this incident, even if the zones have physical fire barriers, chances that operators will be allowed in to one "zone" after another has a fire are slim to none.
True, I implicitly included the MTTR in the "severity", but this is actually a different thing (severity is more about the impact radius).
But I don't think it changes my point: knowing what/how Google Cloud designs regions or zones is still an implementation detail, what matters is what MTTR they are targeting and this should be known ahead of time.
There are so many "implementation details" that customers are not aware of, because they are always changing, non contractual, or just hard to make sense of, what matters is meaningful abstractions.
I am not saying it's OK if the zones are in the same building or not, I don't know and I was really surprised when I discovered this a few years ago. But this information gives you a mental model of "what could go wrong" that is biased towards some specific risks, and in my experience, relying on these very practical aspects make the risk analysis and design decisions harder to make.
Otho, one thing that may be problematic too (and biasing) is that the common understood definition of a "zone" is the one people know from AWS, so using the same term without being very explicit about the differences will also lead to incorrectly calculated risks. I find the public documentation of Google Cloud too vague in general (and often ambiguous).
It sounds like this might just be confusion over nomenclature, with Google and Amazon using different terms for the same thing.
Regardless, with GCP, if you need redundancy that can survive the loss of an entire datacenter, then you need to be multi-regional. This has been widely known best practice for a long time.
Because power / network / software maintenance events cause outages. Those are scheduled per zone, and so they will take down one zone but not a whole data center.
Minimising the blast radius from logical changes (software & config) that get rolled out at an AZ-level.
Their descriptions[0] however promise zones have a "high degree of independence from one another in terms of physical and logical infrastructure". Just how well separated this physical zonal infrastructure was remains to be seen ...
Yeah I feel like that description is a lie. Some customers would probably think twice about putting things into the same region if they knew zones weren't physically separated, or go to AWS.
Ouch. Isn't part of separate zones being protected against something, say, like a terrorist attack or a natural disaster that can take down a whole datacenter?
> Regions are independent geographic areas that consist of zones. Zones and regions are logical abstractions of underlying physical resources provided in one or more physical data centers.
> (...)
> A zone is a deployment area for Google Cloud resources within a region. Zones should be considered a single failure domain within a region. To deploy fault-tolerant applications with high availability and help protect against unexpected failures, deploy your applications across multiple zones in a region.
You should use "region" and "zone" as abstract concepts with shared properties like network topology, local peering, costs, and availability. AFAIK no cloud provider discusses (nor provides guarantees) against specific threats or correlated failures.
There is no guarantee that a given risk will not impact multiple zones, but this risk is lowered by the implementation of various safeguards (for example, rollouts are not happening in multiple regions at the same time).
Google doesn't say "put your VMs in more than one zone because you can be sure we won't have all zones in a region down at the same time", but rather "by putting your VMs in multiple zones in the same region, you can target better SLOs that the SLOs in one zone".
Note that it's different from the concept of "availability zone" of AWS which explicitly says that AZs are physically separated:
> AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.
I recently drove by both the GCP and AWS regions in Oregon. It was so interesting to see one giant facility for GCP, and like 40 separate datacenter buildings for AWS, typically separated by at least half a mile, sometimes tens of miles.
AFAIK if you dig into the details, the different cloud providers have very different concepts of what constitutes an AZ with respect to the types of faults that are isolated.
I always felt a bit scammed with AWS Multi-AZ on RDS that basically doubles your cost. If their set up is anything like this, I now feel vindicated in turning it off....
AWS, for comparison:
> AZs make partitioning applications for high availability easy. If an application is partitioned across AZs, companies are better isolated and protected from issues such as power outages, lightning strikes, tornadoes, earthquakes, and more. AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.