Hacker News new | ask | show | jobs
by y0y 2609 days ago
DNS is also inherently distributed. This should make it resilient to all of the most common outage scenarios, and is likely why AWS offers a 100% uptime SLA for Route 53.

I'll be interested in the post-mortem from Azure on this one.

2 comments

> likely why AWS offers a 100% uptime SLA for Route 53

Well, that's interesting. We occasionally see getaddrinfo() calls fail claiming domains that we know exist at the failure time (b/c the records are completely static) don't exist. (We've not got a reproducible case for this yet, and it's incredibly rare for any given VM/service. But across our fleet, it crops up fairly regularly.)

I used to work on route 53 for a few years. I cant speak to your specific issue. Too much depends on your clients, your networks, your resolvers. But ... turn on query logging at a minimum. You should get a timestamp, qname, and rtype to identify nxdomain.

That said the most common cause of authoritative nxdomain is if youre adding/deleting records and querying them before propagation is complete. You may want to log/poll your rrset change status separately to correlate.

The other is that depending on networks intermediate dns tampering happens all the time. Qname, rname, rtype, all get modified. Responses and queries are duplicated, intercepted, and manipulated. Some good research out of dns oarc and a dude out of australia (iirc).

> We occasionally see getaddrinfo() calls fail claiming domains that we know exist at the failure time (b/c the records are completely static) don't exist.

That could be whatever resolvers you're hitting failing rather than an issue with Route 53 authoritative nameservers, though. The resolving DNS servers in EC2 are not actually part of Route 53, for example.

I'd think that would correspond to EAI_AGAIN or EAI_FAIL, whereas I'm pretty sure we're getting a EAI_NONAME.
We’ve experienced the same thing. I’ve never been able to figure it out. If you ever do, please let me know! I’ll owe you a beer ;)
You may be hitting ec2 dns rate limits.
I would expect EAI_FAIL or EAI_AGAIN, but I'm pretty sure we're getting EAI_NONAME.

But, the stuff that hits this problem the most often is of the quality level that I wouldn't find that terribly surprising. Seems AWS "documents" this as,

> The number of DNS queries per second supported by the Amazon-provided DNS server varies by the type of query, the size of response, and the protocol in use.

How specific.

Do they typically provide a postmortem?
It's Microsoft. I'm sure they just rebooted it!

(I had to, see username!)

edit: seriously,-3 ? it was a joke.