Hacker News new | ask | show | jobs
by gst 1146 days ago
There's an interesting Twitter thread about that topic here: https://twitter.com/GergelyOrosz/status/1651256082424012806

Based on that thread it sounds like only AWS guarantees that their AZs are in physically separate DCs, while for Google and Microsoft AZs could be in separate buildings of the same DC facility.

8 comments

Yes. Azure and GCPs numbers on the size of their AZs and such are more marketing spin than hard engineering. AWS keeps these in separate physical locations to provide true separation. While there have been tech related regional incidents at AWS a physical event disabling multiple AZs would be extremely unlikely given their much more robust and geographically distributed design. If such a physical event had happened in AWS it would have been a non-event with things just failing over to other AZs.

Other cloud providers mostly just vaguely put things in another part of the building and say it’s “a separate AZ” but as GCPs woes highlighted that’s corner cutting that bites badly when the whole building has a problem.

> If such a physical event had happened in AWS it would have been a non-event with things just failing over to other AZs.

In many cases in AWS an availability zone is actually composed of multiple datacenters, each with their own redundancies. This may not be true for smaller regions, but in large ones it definitely is. In those cases, losing an entire datacenter would maybe take out a percentage of instances in that AZ. This has happened before and our production systems barely noticed other than provisioning new nodes to replace the failed health checks.

Googler, opinions are my own.

I think you misunderstand Google's infrastructure. I'm guessing that each GCP zone is actually a Borg Cell (see: https://storage.googleapis.com/pub-tools-public-publication-... ). Borg cells tend to be isolated from eachother in many ways in the physical layer (networking and management being a big one, not sure about power). So networking or machine management for an entire zone could go down and not affect other cells. Changes also tend to get pushed on a per-cell basis when they are Google wide rollouts.

I believe GCP recommends to replicate data cross regions (https://cloud.google.com/architecture/framework/reliability/...).

Also see: https://cloud.google.com/architecture/disaster-recovery#regi...

I don’t know what you’re trying to say with Borg cells, the point of discussion is not that the network etc are separated, but that they’re physically separated in such a way that these kind of flooding wouldn’t affect different AZs, and that GCP is cutting corners here.

Obviously every cloud vendor recommends replicating data between multiple regions, but fact of the matter is that a lot of cloud services work much easier with redundancy within a single region than multi-region redundancy.

I guess it's different types of concerns. My feeling is that Google tries to optimize the resources of a datacenter, and the larger it is, the better things can scale. GCP Zones provide logical separation of machines for management (and network). There may be physical separation, but within a given region, GCP does not advertise this.

I think Google designs their datacenters for their own needs and expect you (a product running in their DCs) to distribute by region. Almost products at Google will be operating in multiple regions given the reach of most of our services, so DC design followed that need.

Based on GCP's docs, they still think region separate is better. Not sure why you wouldn't just do that?

If there is a catastrophic event (a large tornado hit AWS us-east-2), those buildings are pretty close to one another and both likely would be taken out, right? So you could lose multiple AZs since they are physically located so close to one another?

Yeah, you’re not getting what people are saying. AWS’s AZs are much more separated than GCPs. Your recommendation that one could build across regions isn’t what folks are talking about here since there is a big benefit to having geographically separate AZs in the same region. That’s where GCP is falling short here.
AWS treats its availability zones very seriously, each zone has its own independent power substation, air conditioning, and fiber lines.

It's incredibly rare for multiple AZs to go down at once, especially since they are more than a few miles apart from each other.

Funnily enough floods (GCP) and fires (OVH) are two of the 3 things AWS explicitly mentions in the Well Architected docs. For a lot of companies an AZ going down is an annoyance or bad day but a whole region going down could be a real continuity risk.

> Each Availability Zone is separated by a meaningful physical distance from other zones to avoid correlated failure scenarios due to environmental hazards like fires, floods, and tornadoes.

https://docs.aws.amazon.com/wellarchitected/latest/reliabili...

> but a whole region going down could be a real continuity risk

Very much so - Australia only got a second region this year, so if your work required data to remain in Australia, you just had to hope that ap-southeast-2 didn't have a major issues. I'm sure there are plenty of other countries with only a single region.

Unless they’re in us-east-1 and it’s an Amazon software/service fault.
This. Don't use us-east-1, it's by far the flakiest. PDX is also a bit rough, but Ohio is golden.
Ohio has tons of problems, no one should ever put their infra in us-east-2 (shhhhhh...don't let the secret out )
And different flood planes. Source: I was at AWS 2008-2014.
It makes it very easy for me (as someone who comes from a world of physical datacentres) to reason about what an AZ is getting me, and also to understand the benefits of using AWS (not having to think about the details of power routing, blade switch vs top-of-rack vs core switch, storage cabling, blah blah blah).

If I have to think too hard and do too much work about how I lay applications out, I might as well just rent in a colo.

It’s more than that. AZ’s are geographically distinct as well along multiple dimensions include flood plains etc.
For physical zone separation you need to check the `supportsPzs` attribute when listing the zones (e.g. https://cloud.google.com/compute/docs/reference/rest/v1/zone..., but you should be able to find many other places where this attribute is surfaced).

It says "reserved for future use" but other docs mentioned "physical zone separation": https://googleapis.dev/java/google-api-services-compute/alph...

Random datacenters should start advertising availability zones since they should have different fault domains anyway. Google can get away with this, why can't smaller companies?
You would think that the company that literally wrote the book on “Site Reliability Engineering” would actually follow their own recommendations.
Googles advice is not to rely on uptime in every region.

Instead aim for uptime in a few regions, and load balance your users to regions that are healthy.

That design is far cheaper for both google and for you - and, in the typical case, users still get nice low latency to a local datacenter, and only in the rare failure case might they have to wait for latency to some other region.

They do internally. But when customers want 3 zones in Indonesia they cut corners.
Do Google host their own products on Google Cloud, or are there different sets of data centres for Search/Drive/Gmail vs Google Cloud Customers?
This is a leased facility, the kind of place Google rents for cloud customers but doesn't need for itself. Google's own datacenters are https://www.google.com/about/datacenters/locations/
Other way around. Google Cloud runs on the same underlying datacenter, compute, and network infrastructure that Search/Drive/Gmail does.

[edit: at least in regions where Google HAS its own datacenters, e.g. "us-central-1? yes. europe-west9? maybe not"].

That does not imply that Search / Drive / Gmail runs on top of Google Cloud.

The recommendations are to run in multiple regions if you need this kind of redundancy. Run everything in a single region and you can be affected by an event like this.
It amazes me that in every market they serve, Amazon has no actual competitors from a feature perspective.

Like, Target does not compete with Amazon. They have a totally different home delivery model that is not in the same category of reliability or service.

It's annoying.

I think it's because lots of amazons services are in 'winner takes all' markets.

No random online eshop can offer next day delivery across half the world unless they already have a logistics chain of 100,000 truck drivers spread across the world. But Amazon can.

Likewise, no cloud provider has enough data centers to offer multiple separate data centers in the same city, for hundreds of cities around the world. But Amazon does.

Any competitor can't offer amazons level of service until they get to amazon scale... Which they never will.

Or even in the same building, just with a different power/network domain.
Ah I see, I know Azure and GCP in NL are in separate buildings but indeed on the same site. But that's not guaranteed for other regions, good to know.
I would really like to see the physical DC separation at "The Dalles, Oregon".
Looks like there are three buildings[1] to me, not entirely sure what goes where, obviously.

1: https://goo.gl/maps/Tfw5UpSsoYiN3YMVA