Hacker News new | ask | show | jobs
by kurtextrem 1464 days ago
Yet another BGP caused outage. At some point we should collect all of them:

- Cloudflare 2022 (this one)

- Facebook 2021: https://news.ycombinator.com/item?id=28752131 - this one probably had the single biggest impact, since engineers got locked out of their systems, which made the fixing part look like a sci-fi movie

- (Indirectly caused by BGP: Cloudflare 2020: https://blog.cloudflare.com/cloudflare-outage-on-july-17-202...)

- Google Cloud 2020: https://www.theregister.com/2020/12/16/google_europe_outage/

- IBM Cloud 2020: https://www.bleepingcomputer.com/news/technology/ibm-cloud-g...

- Cloudflare 2019: https://news.ycombinator.com/item?id=20262214

- Amazon 2018: https://www.techtarget.com/searchsecurity/news/252439945/BGP...

- AWS: https://www.thousandeyes.com/blog/route-leak-causes-amazon-a... (2015)

- Youtube: https://www.infoworld.com/article/2648947/youtube-outage-und... (2008)

And then there are incidents caused by hijacking: https://en.wikipedia.org/wiki/BGP_hijacking#:~:text=end%20us...

7 comments

Came here to say exactly this... things that mess with BGP have the power to wipe you off the internet.

Some more:

- Google 2016, configuration management bug/BGP: https://status.cloud.google.com/incident/compute/16007

- Valve 2015: https://www.thousandeyes.com/blog/steam-outage-monitor-data-...

- Cloudflare 2013: https://blog.cloudflare.com/todays-outage-post-mortem-82515/

> since engineers got locked out of their systems

Sounds like the same happened here:

"Due to this withdrawal, Cloudflare engineers experienced added difficulty in reaching the affected locations to revert the problematic change. We have backup procedures for handling such an event and used them to take control of the affected locations."

But Cloudflare had sufficient backup connectivity to fix it. I'm curious how Cloudflare does that today-- the solution long ago was always a modem on an auxiliary port.

Worst case if I was designing this I would probably have a satellite connection running over Iridium at each of their biggest DC's

Also lets face it - the utility of a trusted security guard/staff with an old fashioned physical key is pretty hard to screw up!

Not sure how common it is, but you can get serial OOBM devices accessible over cellular which would then give you access to your equipment.

I'm surprised more places don't implement a "click here to confirm changes or it'll be rolled back in 5 minutes" like all those monitor settings dialogues

They have their machines also connected to another AS, so when their network doesn't/can't route, they can still get to their machines to fix stuff.
> the solution long ago was always a modem on an auxiliary port

Now you can use mobile Internet (4G/5G)

Cell coverage inside datacenters isn't always suitable, occasionally even by-design.
You say that like it hasn't been going on since the mid 1990's, when it got deployed.

I'm not blaming BGP, since it prevents far more outages than it causes, but BGP-based outages have been a thing since its beginning. And any other protocol would have outages too - BGP just happens to be the protocol being used.

These are the public facing BGP announcements that cause problems, but doesn't account for the ones on private LANs that also happen. Previous employers of mine have had significant internal network issues because internal BGP between sites started causing problems. I'm not sure there's anything better (I am not a network guy), but this list can't be exhaustive.
The internet runs on BGP, I would think that most internet issues would be a result of BGP then.
There are lots of other causes of incidents, like cut cables, failed router hardware, data centers losing power etc.

It just seems that most of these are local enough and the Internet resilient enough that they don't cause global issues. Maybe the exception would be AWS us-east-1 outages :-)

Maybe a testament to BGP's effectiveness that so many large-scale outages are due to misconfiguring BGP rather than the frequent cable cuts and hardware failures that BGP routes around.
BGP is the reason you don't hear about cable cuts taking down the internet.
Thats like blaming the hammer for breaking.

BGP is just a tool, it would be something else to do the same purpose.

Some tools are more fragile and error prone than others.
Except that this wasn't an example of BGP being prone to error or fragile. This was, as the blog post specifically calls out, human error. They put two BGP announcement rules after the "deny everything not previously allowed" rule. It's the same as if someone did this to a set of ACLs on a firewall.

The main difference between BGP and all other tools is that if you mess up BGP, you've done a very visible thing because BGP underpins how we get to each other's networks. But it's not a sign of BGP being fragile, just very important.

That does seem like bad UX/"DevX" that that configuration of rules is "valid" syntactically and there weren't better equivalents of "linters"/"compilers" flagging that before it ever got sent out as an announcement. UX issues are a "proneness" to error/fragility. It sounds like there is room to build a "higher level language" (like a "Typescript : Javascript :: ? : BGP") for BGP announcements that is less prone to "accidentally bad programs". Not that I have immediate suggestions, just that my gut reaction from skimming these sorts of outage reports is that if it was a "language" I was writing in I can hear that I'd want a lot more (type) safety nets.
Some tools are more prone to human error than others.

Another canonical example is C++. Some tools make it easy to blow your leg off. Some tools provide safety mechanisms to stop the saw from cutting off your finger.