Hacker News new | ask | show | jobs
by can16358p 1716 days ago
Pardon me if it's a stupid question, but out of curiosity:

Is there any way to keep DNS up in case BGP goes down for any reason? Like a fallback nameserver hosted elsewhere/not affected by Facebook's ASs?

Is it technically impossible or did Facebook just assume something like yesterday would never happen and kept things simple instead of complicating things?

3 comments

BGP didn't "go down" - they erroneously removed all routes between the Internet and several facebook internal networks via BGP. BGP was the instrument of their destruction, but not the source. Someone or something told BGP to do that; whatever that was is the cause of the issue.

At least one of those networks they accidentally removed also happened to contain the DNS servers; DNS being unavailable was a symptom - but not part of the root problem. Any focus on DNS at this point is a red herring.

Think of routes as street directions - they tell routers where to ship packets. If you erase all your addresses and directions to them from the outside world at at large, then there literally is no way for network packets to get from the global Internet to Facebooks networks (where I imagine the DNS servers were up and probably twiddling their thumbs wondering where everyone went).

An easier way to think of it - they essentially took a pair of scissors and cut the cable connections to the Internet - which is why it was so catastrophic.

They only way to mitigate that is to have an identical infrastructure managed by different tooling so a bad configuration setting from one environment wouldn't pollute the second in the same way. Not exactly an easy thing to do and might cause more other problems than it's worth. And you would have to do that for all services, not just DNS. Let's say Facebook used Cloudflare for their DNS. Great - DNS can resolve your request for fb.com to the IP address of the facebook datacenter - there still is no path for your packets to get to that facebook datacenter because they accidentally purged the routes to their networks.

It's easier to just not cut your connection to the Internet :) I'm sure there are all kinds of internal discussions picking this incident apart and formulating ways to either prevent it, or more realistically - have improved procedures to speed recovery when it inevitably happens again. BGP is not known for its inherent robustness or security. But since it's at the core of the Internet, any changes to it would have to be done on a massive internet-wide scale in perfect unison or the "cure" would be a lot worse than the current problems with it.

Murphy was indeed an optimist! (search "Murphy's Law" for those unfamiliar with the idiom)

Yes, but...

If it's a FB managed server, run on someone else's network, you still have a lot of the FB software risk (FB's software stack and development mantra make it easy to push changes, some of which break everything, including the ability to push further changes); even if not FB, there's a similar risk.

If it's not a FB managed server, like a 3rd party DNS provider, it's difficult to get that synchronized considering all the fun geographic loadbalancing FB is doing at the DNS level. That's generally hard once you start doing this; and it's why you don't see many dual-provider DNS setups.

Really, the status page should be not on a core domain, so that the DNS can just be external.

FB DNS breaking yesterday almost doesn't matter in the scheme of things, because the BGP breakage broke everything anyway. Would it have been a bit nicer to get http error messages instead of DNS not found messages, sure; but mostly nothing was working anyway.

It’s definitely technically possible to have secondary’s on a separate network that do zone axfr from the primary. That’s not to imply it’s trivial / easy at FB’s scale (query volume) or topology complexity (as in GSLB).