Hacker News new | ask | show | jobs
by siscia 355 days ago
In fairness, their design does not seem to be regional. With problems in one region bringing down another, apparently not unrelated, region.

With this kind of architecture, this sort of problems is just bound to happen.

During my time in AWS, region independence was a must. And some services were able to operate at least for a while without degrading also when some core dependencies were not available. Think like loosing S3.

And after that, the service would keep operating, but with a degraded experience.

I am stunned that this level of isolation is not common in GCP.

4 comments

Global dependencies were disallowed back in 2018 with a tiny handful of exceptions that were difficult or impossible to make fully regional. Chemist, the service that went down, was one of those.

Generally GCP wants regionality, but because it offers so many higher-level inter-region features, some kind of a global layer is basically inevitable.

AWS regions are fundamentally different from GCP regions. GCP marketing tries really hard to make it seem otherwise, or that GCP has all the advantages of AWS regions plus the advantages of their approach, which means heavily on "effectively global" services. There are tradeoffs, for example multi region in GCP is often trivial and GCP can enforce fairness across regions, but that comes at the cost of availability. Which would be fine - GCP SLA's reflect the fact that they rarely consider regions to be a reliable fault containers, but GCP marketing, IMO, creates a dangerous situation by pretending to be something they aren't.

Even in the mini incident report they were going through extreme linguistic gymnastics trying to claim they are regional. Describing the service that caused the outage, which is responsible for global quota enforcement and is configured using a data store that replicates data globally in near real time, with apparently no option to delay replication, they said:

   Service Control is a regional service that has a regional datastore that it reads quota and policy information from. This datastore metadata gets replicated almost instantly globally to manage quota policies for Google Cloud and our customers.
Not only would AWS call this a global service, the whole concept of global quotas would not fly at AWS.
How does AWS do that though? Do the re-implement all the code in every region? Because even the slightest re-use of code could trigger a synchronous (possibly delayed) downtime of all regions.
Reusing code doesn't trigger region dependencies.

> Do the re-implement all the code in every region?

Everyone does.

The difference is AWS very strongly ensures that regions are independent failure domains. The GCP architecture is global with all the pros and cons that implies. e.g GCP has a truly global load balancer while AWS can not since everything is at core regional.

They definitely roll out code (at least for some services) one region at a time. That doesn't prevent old bugs/issues from coming up but it definitely helps prevent new ones from becoming global outages.
Right, that makes sense. But if it's an evil bug that triggers e.g. over a year-change only, then that might not help.

So I suppose theoretically also AWS can go down all together, even if less likely.

Region (and even availability zones) in AWS are independent. The regions all have overlapping IPv4 addresses, so direct cross-region connectivity is impossible.

So it's actually really hard to accidentally make cross-region calls, if you're working inside the AWS infrastructure. The call has to happen over the public Internet, and you need a special approval for that.

Deployments also happen gradually, typically only a few regions at a time. There's an internal tool that allows things to be gradually rolled out and automatically rolled back if monitoring detects that something is off.

Does Route53 depend on services in us-east-1 though? Or maybe it's something else, but i recall us-east-1 downtime causing service downtime for global services
As far as I remember, Route53 is semi-regional. The master copy is kept in us-east-1, but individual regions have replicated data. So if us-east-1 goes down, the individual regions will keep working with the last known state.

Amazon calls this "static stability".

Static stability is a good start, but isn't enough.

In this outage, my service (on GCP) had static stability, which was great. However, some other similar services failed, and we got more load, but we couldn't start additional instances to handle the load because of the outage, and so we had overloaded servers and poor service quality.

Mayhaps we could have adjusted load across regions to manage instance load, but that's not something we normally do.

One of the core pieces of static stability (at least in one definition, it's an overloaded term) is being able to handle failure scenarios from a steady state.

The classic example is overprovisioning so that you can handle the extra zonal load in the event of a zonal outage without needing to scale up.