Hacker News new | ask | show | jobs
Google cloud outage (status.cloud.google.com)
74 points by thomassharoon 2275 days ago
6 comments

Heyo Googler here.

The problem was a mix between another cloud provider and GCP.

Dare I say, there should be little customer impact as of 13:37 PST.....

The status dashboard is going to be your best idea on information.

Is the another cloud provider AWS? I could see tons of connection timeoutes between GCP & S3/Elasticsearch service.

Hope everything is resolved now for good.

Seems AWS, connection to gmail's smtp relay also started timing out.
Oh man I had no idea the big cloud providers have dependencies on other clouds like this.
They do not, according to the dashboard, this issue merely affected connectivity between GCP and other cloud providers.

There was a different outage yesterday, which has nothing to do with the one discussed in this thread.

Given how much trans-continental/trans-oceanic network cable the major cloud providers own, they almost certainly have special trans-cloud network traffic infrastructure. Especially since so much of "The Cloud" is within a few 10s of square miles in a field in Virginia. I can easily see how one provider could majorly disrupt another provider by accidentally breaking inbound traffic on one of those links.
The bigger issue is that there's a lot of customers where they have split cloud deployments, which means the customers hurt even if they are stable within the clouds themselves.
If you are deployed in such a way that both GCP and AWS need to be up you're doing it backwards. Multi-cloud strategy is supposed to result in the intersection of cloud failures, not the union of them.
I have heard that many companies are multi cloud as a result of acquisitions, resulting in a dependency on both clouds.
"But all of our problems are fixed by going to the cloud!"
Yeah, I see that now. Makes total sense.
This can't be real.
[removed]
We were seeing timeouts in east-1. I don't know what "normal" looks like, but Pingdom's map seems to show the whole east coast as affected https://livemap.pingdom.com/
yeah, our GKE pods running in us-east1 were dying ~90minutes ago like crazy... hope they are gonna resolve this soon. not the luckiest day for Google, nor us
I was bummed out when Siteground moved all their cloud accounts over G, without telling their customers beforehand
This is extremely concerning as somebody looking to move or build on top of GCP for the long term. I wonder why anyone would choose GCP if outages are occurring on a regular basis.
Any evidence they happen more frequently that the other clouds?
"We had a router failure in Atlanta".

WHAT? You kidding us?

Urs Hölzle, technical infrastructure at Google Cloud senior vice president, said, "We're very sorry about that! We had a router failure in Atlanta, which affected traffic routed through that region. Things should be back to normal now. Just to make sure: This wasn't related to traffic levels or any kind of overload, our network is not stressed by COVID-19."

Wrong outage.
Was it like... a hardware failure? If you serve more than 100 people you probably should have redundant routers. Was it a configuration issue that replicated over to multiple devices at least, I hope?
Have you worked with redundant routers? They certainly reduce the number of outages, but sometimes the hardware (or software) fails in exciting ways that doesn't engage the redundancy, or doesn't engage it properly, and you still get an outage (or you get an outage that wouldn't have happened). Or sometimes, one circuit is out of service for repair or upgrade, and the other circuit is connected to the router that failed. And routing for the AS that travels on that circuit was set not to fallback to transit because the last time that happened, it caused major issues.

I have no specific knowledge of today's events, but this sort of thing happens. You can get the number of incidents down pretty low, but not to zero.

I remember seeing a Security Analyst for the DC I worked for take down 6 racks worth of Cisco Catalyst 12000 series router hardware once.

They had a HSRP interface set up at the .1 address, and the security analyst set his laptop up with the same static .1 IP address and plugged it in. Instant outage.

I have. I am just highlighting that the problem surely should be more complex than described. Or that their redundancy for these events was not adequately devised.
Google often releases a pretty solid post-mortem, which will give the detail of the event. The level of detail appropriate for same-day release is really 'router failure' or 'power failure' or 'software failure' or 'vehicle drove into the building failure'. Expecting more than 'we know what it was, and we fixed it' or 'we don't know what it was, but it stopped happening' or 'yes, we're working on it' on a same-day twitter post is unreasonable.
yes, because OBVIOUSLY Google is too stupid to know about redundant routers. /s

https://twitter.com/uhoelzle/status/1243259280410554368

"When routers fail cleanly (say, power out) failover is quick, so you never hear about these. This wasn't such a simple case. We have "many" (not just two) routers in Atlanta so it wasn't an issue of missing redundancy."

Networks are harder than everyone thinks. The 2018 CenturyLink outage on the west coast was caused by 1 bad network card that started writing malformed packets.

https://www.geekwire.com/2018/report-huge-centurylink-outage...

Not that simple as you sometimes need to manually isolate the faulty hardware and remove it from service.
Surely the "100 people" metric is too low although I agree at some point (and certainly Google-scale) a redundant router makes sense.