Hacker News new | ask | show | jobs
by hurstdog 3527 days ago
Checking specifically for GCS at https://status.cloud.google.com/summary ...

There are two outages there, one that only affected a few projects and one that affected only service in the central US (we have regions over much of the world).

Anecdotally, from watching and working with other services internally I don't think most of our outages affect all regions. We actually spend a significant amount of engineering effort ensuring that we're as decoupled as possible.

Disclaimer: I'm an Engineering Manager on Google Cloud Storage.

2 comments

On the flip side, many many outages lately have hit AN entire region, making the availability zone bit not so useful. For example the last US Central load balancer issue, it took out the entire central region for anyone using it.

Not sure what happened there, but think a deploy should be rolled out to 1 availability zone at a time for hosted things (like load balancer)

Is there some documentation anywhere talks about how Google Cloud (any product) creates isolation between various regions while at the same time exposing a simple "regionless" programming model?
Probably the best reference is the SRE book, specifically the chapters on loadbalancing and distributed consensus protocols.

Other than that, the general approach is to minimize global control planes and dependencies in our software stack. In the case of GCS, we do have a single namespace which means we need to look up the locations of data early in the request. Once we know the locations of data we can route the request to the right datacenter to serve it. That global location table is highly replicated and cached, of course.

When outages happen, most are caused by changes to the stack, so we also are careful to roll out code or configuration slowly and carefully, slowly increasing the blast radius after it's been proven safe. For example rolling out new binaries first to a few canary instances in one zone, then to a few instances in many regions, then to a full region, then to the world, all spread over a few days.

Disclaimer: I'm an Engineering Manager on Google Cloud Storage.