Hacker News new | ask | show | jobs
by tagux 2275 days ago
"We had a router failure in Atlanta".

WHAT? You kidding us?

Urs Hölzle, technical infrastructure at Google Cloud senior vice president, said, "We're very sorry about that! We had a router failure in Atlanta, which affected traffic routed through that region. Things should be back to normal now. Just to make sure: This wasn't related to traffic levels or any kind of overload, our network is not stressed by COVID-19."

3 comments

Wrong outage.
Was it like... a hardware failure? If you serve more than 100 people you probably should have redundant routers. Was it a configuration issue that replicated over to multiple devices at least, I hope?
Have you worked with redundant routers? They certainly reduce the number of outages, but sometimes the hardware (or software) fails in exciting ways that doesn't engage the redundancy, or doesn't engage it properly, and you still get an outage (or you get an outage that wouldn't have happened). Or sometimes, one circuit is out of service for repair or upgrade, and the other circuit is connected to the router that failed. And routing for the AS that travels on that circuit was set not to fallback to transit because the last time that happened, it caused major issues.

I have no specific knowledge of today's events, but this sort of thing happens. You can get the number of incidents down pretty low, but not to zero.

I remember seeing a Security Analyst for the DC I worked for take down 6 racks worth of Cisco Catalyst 12000 series router hardware once.

They had a HSRP interface set up at the .1 address, and the security analyst set his laptop up with the same static .1 IP address and plugged it in. Instant outage.

I have. I am just highlighting that the problem surely should be more complex than described. Or that their redundancy for these events was not adequately devised.
Google often releases a pretty solid post-mortem, which will give the detail of the event. The level of detail appropriate for same-day release is really 'router failure' or 'power failure' or 'software failure' or 'vehicle drove into the building failure'. Expecting more than 'we know what it was, and we fixed it' or 'we don't know what it was, but it stopped happening' or 'yes, we're working on it' on a same-day twitter post is unreasonable.
yes, because OBVIOUSLY Google is too stupid to know about redundant routers. /s

https://twitter.com/uhoelzle/status/1243259280410554368

"When routers fail cleanly (say, power out) failover is quick, so you never hear about these. This wasn't such a simple case. We have "many" (not just two) routers in Atlanta so it wasn't an issue of missing redundancy."

Networks are harder than everyone thinks. The 2018 CenturyLink outage on the west coast was caused by 1 bad network card that started writing malformed packets.

https://www.geekwire.com/2018/report-huge-centurylink-outage...

Not that simple as you sometimes need to manually isolate the faulty hardware and remove it from service.
Surely the "100 people" metric is too low although I agree at some point (and certainly Google-scale) a redundant router makes sense.