Hacker News new | ask | show | jobs
by bushbaba 1207 days ago
This demonstrates yet again why global configurations, global services, and global anycast VIP routing should be considered an anti pattern.

gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

11 comments

You can't really have 30+ fully independent regions running their own stack with different versions of apps and separate secrets, IP/routing and certificates in each. At some point you have to unify or it becomes either unmanageable or inconsistent.
Right. You want regions to be fully independent, yet the software stacks they are running to be fully synchronized and consistent. So there’s a tension. If there’s a sleeper bug that wakes only after it has been rolled out to every region, you’ve got a global outage. Given the increasing complexity of these systems, it will always be possible to find all those.
Most of GCP’s customers can’t, but independent regions are one of the benefits that a well architected cloud provider can give you to build on.
do you mean the cloud provider can't, or the customer can't?
But you can have 3. Why did you choose 30?

In my company we are split in 3, US, EU, APAC, and we have the same issue with global outage for stuff we could have just managed regionally. For all the savings of the global architecture, they disappear each minute a client is down on a global outage because a guy thousands of kms away messed up.

You dont have to unify, at all. You dont unify with your competitors, and the world has not exploded: compete internally between regions ?

GCP needs to support 30 regions because... they're a cloud provider.
Then they can do a glocal model with 10 regions grouped into 3 semi global groups, so when there s a global outage, it can only be on one of these ?
How does this fit in with upcoming EU data sovereignty laws?
The underlying problem is that Google doesn't operate the world's DNS servers, but still wants to offer the best possible user experience as a global service. This means anycast VIP routing, because not all DNS servers implement EDNS, but they want to have SSL connections terminate as closely to users as possible.

As far as global services go though, it's easy enough to say "it should just not be possible", but how do you propose doing that in practice for a global service?

How does new config going to go out, globally, without being global? How do global services work if they're not global? How does DDoS protection work if you don't do it globally?

People make fun of "webscale" but operating Google is really difficult and complicated!

AWS US east 1 had significant downtime last year so I'm not sure what you're trying to say with that link. Would you mind expanding on your thoughts?
One region failing (especially us-east-1) is common, but it's very rare to see an AWS global outage.
This. us-east-1 is the oldest region IIRC and it has its share of issues. Back when I used to work mostly on AWS zonal outages used to happen once in a while, but entire regions were rare, forget global outages.

The global outage thing seems to be a consistent "feature" of GCP - how are we supposed to architect our deployments if the regional isolation model is not a bulwark against high availability on GCP?

> gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

As I understand it, GCP is already designed to make global outages impossible. Obviously this outage shows that they messed up somehow and some global point of failure still remains. Looking forward to the post-mortem.

They had many many global outages through the years so that’s evidently not true. GCLB, iam, gcs and probably more Im missing just of the top of my head. Then there’s constant stream of regional networking borks where your latency is suddenly 5x which are not “global” but affect multiple regions
Anecdata, but in my experience Google Cloud has been MUCH more solid than my time spent on AWS.
While a fair point it's in no way a counter argument to what the person above was saying. Having fewer outages is not the same as having no global outages.
Historically that has not been my experience at all tho tbf gcp has cleaned up their act substantially in the past 1-2 years
They had lots of global outages in past years, but in recent years they have become increasingly rare, presumably because of a move away from global points of failure
My knowledge level: can use AWS console to do < 5% of what is possible.

How much more work would Google create for themselves if they had not globalized their stack? Are we talking something like 5 subsets to manage instead of 1?

Most of it is cellular or regional, but there are a few critical global services. The global network load balancing, network qos, and ddos prevention are more functional because they are global (i.e. you couldn't replace them with equivalent regional versions), but are often causes of issues like this. There was a push a few years ago to ensure global services had at least 99.999% uptime or make them regional. This was a 48 minute outage, so it blows that five 9 budget for 9 years.

Ex-googler, no particular knowledge of this event, information might be out of date.

The pattern for past large google outages has been:

1. Some networking-related service has global, non-standard (compared to the rest of the company) configuration

2. The relevant VP is aware and has decided not to change it because that change is quoted as impossible

3. Some change elsewhere happens that assumes standard configuration

4. The networking service breaks and causes a global outage

5. VP is told to fix it

6. Fix rolls out in weeks, because it wasn't as hard as they said before

Often "impossible" is based on constraints like "0 downtime" "100% planned rollout, rollback scenarios" etc.

These constraints get thrown to the wind when the downtime is already happening.

I was being a bit hyperbolic, but this is the real reason. However, the VPs in question often have the authority to approve changes that don't have rollback scenarios (for example), they just don't until the shit hits the fan.
Assuming good automation, most of the work comes in being able to do a second of something instead of just having one. The difference in work between “single point” and “multiple point” is a lot, but increasing the multiple points beyond that isn’t too bad.

Of course, if you deploy a change to all of your separated stacks at once through some sort of automated pipeline it doesn’t matter too much. Easy to break everything simultaneously that way if there’s some difference between test and prod you didn’t realize was there.

If you get into the nitty gritty of it, it doesn't really make sense. Are you going to have 5 different load balancer software stacks, with 5 different config file languages, causing each client (say Gmail) to have to implement their config 5 different ways? That's insane.
My biggest AWS surprise bill (so far!) was due to a bug in AWS console region switching.
gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

I reckon the only to achieve that would be to have the same level of interoperability between regions as you would get between two distinct cloud providers.

From the messaging, this seems like a partial network outage.

Of course, at Google scale 'partial' is still very big.

> This demonstrates yet again why global configurations, global services, and global anycast VIP routing should be considered an anti pattern.

And why enterprises clamoring for AWS to feature match Google's global stuff (theoretically making I.T. easier) instead of remaining regionally isolated (making I.T. actually more resilient, without extra work if I.T. operators can figure out infra-as-code patterns) should STFU and learn themselves some Terraform, Pulumi, or etc.

Also, AWS, if you're in this thread, stop with the recent cross-region coupling features already. Google's doing it wrong, explain that, and be patient, the market share will come back to you when they run out of the GCP subsidy dollars.

You really want to go through every region to find what VMs are running? Why can this not be a single page with all VMs listed?
The AWS console already has a single page where you can see how many EC2/networking resources you have in every region.

  > gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.
If that's what you really need, then distribute your assets across GCP, AWS, and DO. That likely means not using any cloud-specific features such as Lambda. AWS is actually really good in this regard, as SES and RDS are easily copied to regular instances in other cloud providers, that possibly wrap some cloud-specific feature themselves.
But....cost.
Reliable, cheap, or ....? Pick N-1.
For reference / comparison, how many regional outages have there been? Did service outages get avoided due to running a workload in multiple regions?
Because copypasting from A to B is much safer...
there are three things that scare google engineers enough to keep them up at night: a global network outage, a global power outage, and a global chubby outage. Actually, they only really worry about that last one.