Hacker News new | ask | show | jobs
by jonathanoliver 3531 days ago
Is there some documentation anywhere talks about how Google Cloud (any product) creates isolation between various regions while at the same time exposing a simple "regionless" programming model?
1 comments

Probably the best reference is the SRE book, specifically the chapters on loadbalancing and distributed consensus protocols.

Other than that, the general approach is to minimize global control planes and dependencies in our software stack. In the case of GCS, we do have a single namespace which means we need to look up the locations of data early in the request. Once we know the locations of data we can route the request to the right datacenter to serve it. That global location table is highly replicated and cached, of course.

When outages happen, most are caused by changes to the stack, so we also are careful to roll out code or configuration slowly and carefully, slowly increasing the blast radius after it's been proven safe. For example rolling out new binaries first to a few canary instances in one zone, then to a few instances in many regions, then to a full region, then to the world, all spread over a few days.

Disclaimer: I'm an Engineering Manager on Google Cloud Storage.