Hacker News new | ask | show | jobs
by JCM9 1146 days ago
Yes. Azure and GCPs numbers on the size of their AZs and such are more marketing spin than hard engineering. AWS keeps these in separate physical locations to provide true separation. While there have been tech related regional incidents at AWS a physical event disabling multiple AZs would be extremely unlikely given their much more robust and geographically distributed design. If such a physical event had happened in AWS it would have been a non-event with things just failing over to other AZs.

Other cloud providers mostly just vaguely put things in another part of the building and say it’s “a separate AZ” but as GCPs woes highlighted that’s corner cutting that bites badly when the whole building has a problem.

2 comments

> If such a physical event had happened in AWS it would have been a non-event with things just failing over to other AZs.

In many cases in AWS an availability zone is actually composed of multiple datacenters, each with their own redundancies. This may not be true for smaller regions, but in large ones it definitely is. In those cases, losing an entire datacenter would maybe take out a percentage of instances in that AZ. This has happened before and our production systems barely noticed other than provisioning new nodes to replace the failed health checks.

Googler, opinions are my own.

I think you misunderstand Google's infrastructure. I'm guessing that each GCP zone is actually a Borg Cell (see: https://storage.googleapis.com/pub-tools-public-publication-... ). Borg cells tend to be isolated from eachother in many ways in the physical layer (networking and management being a big one, not sure about power). So networking or machine management for an entire zone could go down and not affect other cells. Changes also tend to get pushed on a per-cell basis when they are Google wide rollouts.

I believe GCP recommends to replicate data cross regions (https://cloud.google.com/architecture/framework/reliability/...).

Also see: https://cloud.google.com/architecture/disaster-recovery#regi...

I don’t know what you’re trying to say with Borg cells, the point of discussion is not that the network etc are separated, but that they’re physically separated in such a way that these kind of flooding wouldn’t affect different AZs, and that GCP is cutting corners here.

Obviously every cloud vendor recommends replicating data between multiple regions, but fact of the matter is that a lot of cloud services work much easier with redundancy within a single region than multi-region redundancy.

I guess it's different types of concerns. My feeling is that Google tries to optimize the resources of a datacenter, and the larger it is, the better things can scale. GCP Zones provide logical separation of machines for management (and network). There may be physical separation, but within a given region, GCP does not advertise this.

I think Google designs their datacenters for their own needs and expect you (a product running in their DCs) to distribute by region. Almost products at Google will be operating in multiple regions given the reach of most of our services, so DC design followed that need.

Based on GCP's docs, they still think region separate is better. Not sure why you wouldn't just do that?

If there is a catastrophic event (a large tornado hit AWS us-east-2), those buildings are pretty close to one another and both likely would be taken out, right? So you could lose multiple AZs since they are physically located so close to one another?

Yeah, you’re not getting what people are saying. AWS’s AZs are much more separated than GCPs. Your recommendation that one could build across regions isn’t what folks are talking about here since there is a big benefit to having geographically separate AZs in the same region. That’s where GCP is falling short here.