Hacker News new | ask | show | jobs
by nielsole 1155 days ago
> Water intrusion in europe-west9-a

> We expect general unavailability of the europe-west9 region.

Why would emergency shutdown of a single AZ lead to general unavailability of a region? Isn't that the point of multiple AZs?

> There is no current ETA for recovery of operations in the europe-west9 region at this time, but it is expected to be an extended outage

yikes

1 comments

From other comments here, it sounds like multiple zones in that region are located in the same datacenter?

If so, that's ... not good.

that’s how GCP does zones, firewalled off with separate networks/power in the same physical location.
That's just ridiculous.

AWS, for comparison:

> AZs make partitioning applications for high availability easy. If an application is partitioned across AZs, companies are better isolated and protected from issues such as power outages, lightning strikes, tornadoes, earthquakes, and more. AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.

AWS has similarly suffered outages from an entire datacenter being taken out like this. No one is immune. If you want true fault-tolerance you need to be multi-regional (everyone says as much), ideally, multi-continental.

europe-west9 is the only large Google datacenter in France afaik. Building more would cost lots more money, and it seems like the market isn't there for it. Workloads that require data locality in France are presumably suffering the most. And there are knock-on effects on other datacenters from losing an entire huge chunk of capacity like this.

Eh source for that. AWS has had issues where a single Zone caused such a lack of capacity in the region that some multi-zone services degraded to the point of a domino fail-over. However I've not heard of any AWS event where a fire/flood in AZ A also caused a fire/flood in AZ B.
But does it really matter that the incident is a flood or a cascading software failure if the likelihood and severity is the same?

Being in the same building is an "implementation detail" from a customer perspective, what matters is the consequences of this decision.

For example, maybe this decision allows for better network connectivity at a lower cost for inter-zones traffic, while, on the other hand, not protecting against some classes of risks.

In the end, you can have a similar multi-zone outage keeping the region down for an extended period of time just because of a bad network config push (see the massive facebook outage in 2021). As a customer, I don't care if it's a flood or a network outage.

Imho, what matters the most is a clear documentation of how these abstractions work for users and the corresponding contractual agreements (costs, SLAs, etc). Users can thus decide if they are ready to pay the price of protecting themselves against an extended outage impacting a single region.

It sounds like this might just be confusion over nomenclature, with Google and Amazon using different terms for the same thing.

Regardless, with GCP, if you need redundancy that can survive the loss of an entire datacenter, then you need to be multi-regional. This has been widely known best practice for a long time.

Are you joking? Please tell me that’s a joke, because there’s no way a cloud provider that big could be that daft.

If that’s true, what’s the fucking point of separating them at all?

Because power / network / software maintenance events cause outages. Those are scheduled per zone, and so they will take down one zone but not a whole data center.
Minimising the blast radius from logical changes (software & config) that get rolled out at an AZ-level.

Their descriptions[0] however promise zones have a "high degree of independence from one another in terms of physical and logical infrastructure". Just how well separated this physical zonal infrastructure was remains to be seen ...

[0] https://cloud.google.com/architecture/disaster-recovery#regi...

Yeah I feel like that description is a lie. Some customers would probably think twice about putting things into the same region if they knew zones weren't physically separated, or go to AWS.
Up-sell.
Ouch. Isn't part of separate zones being protected against something, say, like a terrorist attack or a natural disaster that can take down a whole datacenter?
From https://cloud.google.com/docs/geography-and-regions#regions_...

> Regions are independent geographic areas that consist of zones. Zones and regions are logical abstractions of underlying physical resources provided in one or more physical data centers. > (...) > A zone is a deployment area for Google Cloud resources within a region. Zones should be considered a single failure domain within a region. To deploy fault-tolerant applications with high availability and help protect against unexpected failures, deploy your applications across multiple zones in a region.

You should use "region" and "zone" as abstract concepts with shared properties like network topology, local peering, costs, and availability. AFAIK no cloud provider discusses (nor provides guarantees) against specific threats or correlated failures.

There is no guarantee that a given risk will not impact multiple zones, but this risk is lowered by the implementation of various safeguards (for example, rollouts are not happening in multiple regions at the same time).

Google doesn't say "put your VMs in more than one zone because you can be sure we won't have all zones in a region down at the same time", but rather "by putting your VMs in multiple zones in the same region, you can target better SLOs that the SLOs in one zone".

Note that it's different from the concept of "availability zone" of AWS which explicitly says that AZs are physically separated:

> AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.

https://aws.amazon.com/about-aws/global-infrastructure/regio...

I recently drove by both the GCP and AWS regions in Oregon. It was so interesting to see one giant facility for GCP, and like 40 separate datacenter buildings for AWS, typically separated by at least half a mile, sometimes tens of miles.
There are 2 buildings there for Google, serving 3 cloud zones. One of those buildings was google's first datacenter, so used some older ideas.

They are actually in the process of building 3 more buildings a ways down the road for more capacity.

AFAIK if you dig into the details, the different cloud providers have very different concepts of what constitutes an AZ with respect to the types of faults that are isolated.
I always felt a bit scammed with AWS Multi-AZ on RDS that basically doubles your cost. If their set up is anything like this, I now feel vindicated in turning it off....
It isn’t like this. AWS Availability Zones are in separate physical facilities by design, regardless of region.

https://aws.amazon.com/about-aws/global-infrastructure/regio...