| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bushbaba 1207 days ago
	This demonstrates yet again why global configurations, global services, and global anycast VIP routing should be considered an anti pattern. gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

11 comments

tgtweak 1207 days ago

You can't really have 30+ fully independent regions running their own stack with different versions of apps and separate secrets, IP/routing and certificates in each. At some point you have to unify or it becomes either unmanageable or inconsistent.

link

markstos 1207 days ago

Right. You want regions to be fully independent, yet the software stacks they are running to be fully synchronized and consistent. So there’s a tension. If there’s a sleeper bug that wakes only after it has been rolled out to every region, you’ve got a global outage. Given the increasing complexity of these systems, it will always be possible to find all those.

link

skywhopper 1207 days ago

Most of GCP’s customers can’t, but independent regions are one of the benefits that a well architected cloud provider can give you to build on.

link

dastbe 1207 days ago

do you mean the cloud provider can't, or the customer can't?

link

xwolfi 1207 days ago

But you can have 3. Why did you choose 30?

In my company we are split in 3, US, EU, APAC, and we have the same issue with global outage for stuff we could have just managed regionally. For all the savings of the global architecture, they disappear each minute a client is down on a global outage because a guy thousands of kms away messed up.

You dont have to unify, at all. You dont unify with your competitors, and the world has not exploded: compete internally between regions ?

link

shepherdjerred 1206 days ago

GCP needs to support 30 regions because... they're a cloud provider.

link

xwolfi 1206 days ago

Then they can do a glocal model with 10 regions grouped into 3 semi global groups, so when there s a global outage, it can only be on one of these ?

link

esperent 1206 days ago

How does this fit in with upcoming EU data sovereignty laws?

link

zamnos 1207 days ago

The underlying problem is that Google doesn't operate the world's DNS servers, but still wants to offer the best possible user experience as a global service. This means anycast VIP routing, because not all DNS servers implement EDNS, but they want to have SSL connections terminate as closely to users as possible.

As far as global services go though, it's easy enough to say "it should just not be possible", but how do you propose doing that in practice for a global service?

How does new config going to go out, globally, without being global? How do global services work if they're not global? How does DDoS protection work if you don't do it globally?

People make fun of "webscale" but operating Google is really difficult and complicated!

link

otterley 1207 days ago

https://aws.amazon.com/builders-library/automating-safe-hand...

link

zamnos 1207 days ago

AWS US east 1 had significant downtime last year so I'm not sure what you're trying to say with that link. Would you mind expanding on your thoughts?

link

shepherdjerred 1206 days ago

One region failing (especially us-east-1) is common, but it's very rare to see an AWS global outage.

link

talonx 1206 days ago

This. us-east-1 is the oldest region IIRC and it has its share of issues. Back when I used to work mostly on AWS zonal outages used to happen once in a while, but entire regions were rare, forget global outages.

The global outage thing seems to be a consistent "feature" of GCP - how are we supposed to architect our deployments if the regional isolation model is not a bulwark against high availability on GCP?

link

medler 1207 days ago

> gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

As I understand it, GCP is already designed to make global outages impossible. Obviously this outage shows that they messed up somehow and some global point of failure still remains. Looking forward to the post-mortem.

link

dilyevsky 1207 days ago

They had many many global outages through the years so that’s evidently not true. GCLB, iam, gcs and probably more Im missing just of the top of my head. Then there’s constant stream of regional networking borks where your latency is suddenly 5x which are not “global” but affect multiple regions

link

throitallaway 1207 days ago

Anecdata, but in my experience Google Cloud has been MUCH more solid than my time spent on AWS.

link

AlecSchueler 1207 days ago

While a fair point it's in no way a counter argument to what the person above was saying. Having fewer outages is not the same as having no global outages.

link

dilyevsky 1207 days ago

Historically that has not been my experience at all tho tbf gcp has cleaned up their act substantially in the past 1-2 years

link

medler 1206 days ago

They had lots of global outages in past years, but in recent years they have become increasingly rare, presumably because of a move away from global points of failure

link

consumer451 1207 days ago

My knowledge level: can use AWS console to do < 5% of what is possible.

How much more work would Google create for themselves if they had not globalized their stack? Are we talking something like 5 subsets to manage instead of 1?

link

singron 1207 days ago

Most of it is cellular or regional, but there are a few critical global services. The global network load balancing, network qos, and ddos prevention are more functional because they are global (i.e. you couldn't replace them with equivalent regional versions), but are often causes of issues like this. There was a push a few years ago to ensure global services had at least 99.999% uptime or make them regional. This was a 48 minute outage, so it blows that five 9 budget for 9 years.

Ex-googler, no particular knowledge of this event, information might be out of date.

link

pclmulqdq 1207 days ago

The pattern for past large google outages has been:

1. Some networking-related service has global, non-standard (compared to the rest of the company) configuration

2. The relevant VP is aware and has decided not to change it because that change is quoted as impossible

3. Some change elsewhere happens that assumes standard configuration

4. The networking service breaks and causes a global outage

5. VP is told to fix it

6. Fix rolls out in weeks, because it wasn't as hard as they said before

link

richthegeek 1207 days ago

Often "impossible" is based on constraints like "0 downtime" "100% planned rollout, rollback scenarios" etc.

These constraints get thrown to the wind when the downtime is already happening.

link

pclmulqdq 1207 days ago

I was being a bit hyperbolic, but this is the real reason. However, the VPs in question often have the authority to approve changes that don't have rollback scenarios (for example), they just don't until the shit hits the fan.

link

NineStarPoint 1207 days ago

Assuming good automation, most of the work comes in being able to do a second of something instead of just having one. The difference in work between “single point” and “multiple point” is a lot, but increasing the multiple points beyond that isn’t too bad.

Of course, if you deploy a change to all of your separated stacks at once through some sort of automated pipeline it doesn’t matter too much. Easy to break everything simultaneously that way if there’s some difference between test and prod you didn’t realize was there.

link

zamnos 1207 days ago

If you get into the nitty gritty of it, it doesn't really make sense. Are you going to have 5 different load balancer software stacks, with 5 different config file languages, causing each client (say Gmail) to have to implement their config 5 different ways? That's insane.

link

jjoonathan 1207 days ago

My biggest AWS surprise bill (so far!) was due to a bug in AWS console region switching.

link

kevinventullo 1207 days ago

gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

I reckon the only to achieve that would be to have the same level of interoperability between regions as you would get between two distinct cloud providers.

link

uniformlyrandom 1207 days ago

From the messaging, this seems like a partial network outage.

Of course, at Google scale 'partial' is still very big.

link

Terretta 1207 days ago

> This demonstrates yet again why global configurations, global services, and global anycast VIP routing should be considered an anti pattern.

And why enterprises clamoring for AWS to feature match Google's global stuff (theoretically making I.T. easier) instead of remaining regionally isolated (making I.T. actually more resilient, without extra work if I.T. operators can figure out infra-as-code patterns) should STFU and learn themselves some Terraform, Pulumi, or etc.

Also, AWS, if you're in this thread, stop with the recent cross-region coupling features already. Google's doing it wrong, explain that, and be patient, the market share will come back to you when they run out of the GCP subsidy dollars.

link

verdverm 1207 days ago

You really want to go through every region to find what VMs are running? Why can this not be a single page with all VMs listed?

link

shitlord 1207 days ago

The AWS console already has a single page where you can see how many EC2/networking resources you have in every region.

link

dotancohen 1207 days ago

  > gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

If that's what you really need, then distribute your assets across GCP, AWS, and DO. That likely means not using any cloud-specific features such as Lambda. AWS is actually really good in this regard, as SES and RDS are easily copied to regular instances in other cloud providers, that possibly wrap some cloud-specific feature themselves.

link

talonx 1206 days ago

But....cost.

link

dotancohen 1206 days ago

Reliable, cheap, or ....? Pick N-1.

link

Cthulhu_ 1207 days ago

For reference / comparison, how many regional outages have there been? Did service outages get avoided due to running a workload in multiple regions?

link

papruapap 1207 days ago

Because copypasting from A to B is much safer...

link

dekhn 1207 days ago

there are three things that scare google engineers enough to keep them up at night: a global network outage, a global power outage, and a global chubby outage. Actually, they only really worry about that last one.

link