Hacker News new | ask | show | jobs
by fauria 1716 days ago
Facebook in this case, operates a set of intermediary DNS servers that are responsible for everything between your ISP's recursers and the roots. These are responsible for facebook.com, instagram.com, whatsapp.com and everything else they operate.

This is not the case for instagram.com, which is hosted on a different provider (AWS Route53) and was resolvable during the whole outage.

I'm not sure why Instagram's fronted servers returned 503, though. Maybe their backend fleet was included in the withdrawn prefixes, or maybe it was referenced through the affected domains.

2 comments

"I'm not sure why Instagram's frontend servers returned 503, though."

One explanation is Facebook uses a proxy configuration that requires DNS in order to resolve the internal IP addresses for the backend servers. High availability proxy servers like haproxy can easily use files loaded into memory to do lookups, instead of making DNS requests. Apparently Facebook had no backup plan if the DNS method started failing. Facebook remained down until their DNS servers became available. The proxies continued to work and no doubt the backend servers were available the entire time, but proxies could not connect to them because the DNS lookups for their internal IP addresses (serv)failed. After the retried DNS queries finally timeout, a 503 is returned.

"Maybe their backend fleet was included in the withdrawn prefixes..."

According to Cloudflare's writeup the only prefixes withdrawn were for DNS servers.

Another possibility is that failing to announce the prefixes for their DNS server IPs was just a symptom of a larger problem, like misconfigured routers.
Kind of funny that instagram.com uses Route53, but amazon.com does not.