Google cloud outage

Y	Hacker News new \| ask \| show \| jobs

	Google cloud outage (status.cloud.google.com)
	74 points by thomassharoon 2275 days ago

6 comments

qmarchi 2275 days ago

Heyo Googler here.

The problem was a mix between another cloud provider and GCP.

Dare I say, there should be little customer impact as of 13:37 PST.....

The status dashboard is going to be your best idea on information.

link

svacko 2275 days ago

Is the another cloud provider AWS? I could see tons of connection timeoutes between GCP & S3/Elasticsearch service.

Hope everything is resolved now for good.

link

judge2020 2275 days ago

Seems AWS, connection to gmail's smtp relay also started timing out.

link

gigatexal 2274 days ago

Oh man I had no idea the big cloud providers have dependencies on other clouds like this.

link

lima 2274 days ago

They do not, according to the dashboard, this issue merely affected connectivity between GCP and other cloud providers.

There was a different outage yesterday, which has nothing to do with the one discussed in this thread.

link

dodobirdlord 2274 days ago

Given how much trans-continental/trans-oceanic network cable the major cloud providers own, they almost certainly have special trans-cloud network traffic infrastructure. Especially since so much of "The Cloud" is within a few 10s of square miles in a field in Virginia. I can easily see how one provider could majorly disrupt another provider by accidentally breaking inbound traffic on one of those links.

link

qmarchi 2274 days ago

The bigger issue is that there's a lot of customers where they have split cloud deployments, which means the customers hurt even if they are stable within the clouds themselves.

link

thedance 2274 days ago

If you are deployed in such a way that both GCP and AWS need to be up you're doing it backwards. Multi-cloud strategy is supposed to result in the intersection of cloud failures, not the union of them.

link

uluyol 2274 days ago

I have heard that many companies are multi cloud as a result of acquisitions, resulting in a dependency on both clouds.

link

qmarchi 2274 days ago

"But all of our problems are fixed by going to the cloud!"

link

gigatexal 2274 days ago

Yeah, I see that now. Makes total sense.

link

the-dude 2275 days ago

This can't be real.

[removed]

We were seeing timeouts in east-1. I don't know what "normal" looks like, but Pingdom's map seems to show the whole east coast as affected https://livemap.pingdom.com/

link

svacko 2275 days ago

yeah, our GKE pods running in us-east1 were dying ~90minutes ago like crazy... hope they are gonna resolve this soon. not the luckiest day for Google, nor us

link

x__x 2274 days ago

I was bummed out when Siteground moved all their cloud accounts over G, without telling their customers beforehand

link

kgraves 2274 days ago

This is extremely concerning as somebody looking to move or build on top of GCP for the long term. I wonder why anyone would choose GCP if outages are occurring on a regular basis.

link

pgodzin 2274 days ago

Any evidence they happen more frequently that the other clouds?

link

tagux 2275 days ago

"We had a router failure in Atlanta".

WHAT? You kidding us?

Urs Hölzle, technical infrastructure at Google Cloud senior vice president, said, "We're very sorry about that! We had a router failure in Atlanta, which affected traffic routed through that region. Things should be back to normal now. Just to make sure: This wasn't related to traffic levels or any kind of overload, our network is not stressed by COVID-19."

link

thedance 2274 days ago

Wrong outage.

link

neonate 2275 days ago

https://twitter.com/uhoelzle/status/1243217659690278912

link

ocdtrekkie 2275 days ago

Was it like... a hardware failure? If you serve more than 100 people you probably should have redundant routers. Was it a configuration issue that replicated over to multiple devices at least, I hope?

link

toast0 2274 days ago

Have you worked with redundant routers? They certainly reduce the number of outages, but sometimes the hardware (or software) fails in exciting ways that doesn't engage the redundancy, or doesn't engage it properly, and you still get an outage (or you get an outage that wouldn't have happened). Or sometimes, one circuit is out of service for repair or upgrade, and the other circuit is connected to the router that failed. And routing for the AS that travels on that circuit was set not to fallback to transit because the last time that happened, it caused major issues.

I have no specific knowledge of today's events, but this sort of thing happens. You can get the number of incidents down pretty low, but not to zero.

link

nineteen999 2274 days ago

I remember seeing a Security Analyst for the DC I worked for take down 6 racks worth of Cisco Catalyst 12000 series router hardware once.

They had a HSRP interface set up at the .1 address, and the security analyst set his laptop up with the same static .1 IP address and plugged it in. Instant outage.

link

ocdtrekkie 2274 days ago

I have. I am just highlighting that the problem surely should be more complex than described. Or that their redundancy for these events was not adequately devised.

link

toast0 2274 days ago

Google often releases a pretty solid post-mortem, which will give the detail of the event. The level of detail appropriate for same-day release is really 'router failure' or 'power failure' or 'software failure' or 'vehicle drove into the building failure'. Expecting more than 'we know what it was, and we fixed it' or 'we don't know what it was, but it stopped happening' or 'yes, we're working on it' on a same-day twitter post is unreasonable.

link

packetslave 2274 days ago

yes, because OBVIOUSLY Google is too stupid to know about redundant routers. /s

https://twitter.com/uhoelzle/status/1243259280410554368

"When routers fail cleanly (say, power out) failover is quick, so you never hear about these. This wasn't such a simple case. We have "many" (not just two) routers in Atlanta so it wasn't an issue of missing redundancy."

link

dodobirdlord 2274 days ago

Networks are harder than everyone thinks. The 2018 CenturyLink outage on the west coast was caused by 1 bad network card that started writing malformed packets.

https://www.geekwire.com/2018/report-huge-centurylink-outage...

link

AdamJacobMuller 2274 days ago

Not that simple as you sometimes need to manually isolate the faulty hardware and remove it from service.

link

thanksforfish 2274 days ago

Surely the "100 people" metric is too low although I agree at some point (and certainly Google-scale) a redundant router makes sense.

link