Hacker News new | ask | show | jobs
by EricE 1715 days ago
BGP didn't "go down" - they erroneously removed all routes between the Internet and several facebook internal networks via BGP. BGP was the instrument of their destruction, but not the source. Someone or something told BGP to do that; whatever that was is the cause of the issue.

At least one of those networks they accidentally removed also happened to contain the DNS servers; DNS being unavailable was a symptom - but not part of the root problem. Any focus on DNS at this point is a red herring.

Think of routes as street directions - they tell routers where to ship packets. If you erase all your addresses and directions to them from the outside world at at large, then there literally is no way for network packets to get from the global Internet to Facebooks networks (where I imagine the DNS servers were up and probably twiddling their thumbs wondering where everyone went).

An easier way to think of it - they essentially took a pair of scissors and cut the cable connections to the Internet - which is why it was so catastrophic.

They only way to mitigate that is to have an identical infrastructure managed by different tooling so a bad configuration setting from one environment wouldn't pollute the second in the same way. Not exactly an easy thing to do and might cause more other problems than it's worth. And you would have to do that for all services, not just DNS. Let's say Facebook used Cloudflare for their DNS. Great - DNS can resolve your request for fb.com to the IP address of the facebook datacenter - there still is no path for your packets to get to that facebook datacenter because they accidentally purged the routes to their networks.

It's easier to just not cut your connection to the Internet :) I'm sure there are all kinds of internal discussions picking this incident apart and formulating ways to either prevent it, or more realistically - have improved procedures to speed recovery when it inevitably happens again. BGP is not known for its inherent robustness or security. But since it's at the core of the Internet, any changes to it would have to be done on a massive internet-wide scale in perfect unison or the "cure" would be a lot worse than the current problems with it.

Murphy was indeed an optimist! (search "Murphy's Law" for those unfamiliar with the idiom)