Hacker News new | ask | show | jobs
by Kalium 3533 days ago
Imagine migrating your website to a new host. A month later, you learn that a major ISP has decided that its customers don't need to know about the move, because they hold on to last-known-good records as they like. So half your traffic and business is gone. Or maybe you can't use anything run on Heroku, because the dynamicism there doesn't play nice with your resolver's policies.

That's the kind of world we used to live in when TTLs were often treated as vague suggestions.

1 comments

The scenario I was describing was one where a last-known-good resolution would be used if and only if a refresh attempt fails after the authority-provided TTL expires.

I believe the scenario you are describing is a rogue ISP ignoring that authoritative TTL wholesale, caching resolutions according to its own preferences regardless of whether the authority is able to provide a response after the authoritative TTL expires.

The rogue ISPs thought they were helping people by serving stale data. After all, better something past its use-by date than failing, right? A low tolerance for DNS response times, and suddenly large chunks of the internet are failing a lot...

Among other problems, this enables attacks. Leak a route, DDoS a DNS provider, and watch as traffic everywhere goes to an attack server because servers everywhere "protect" people by serving known-stale data rather than failing safe.

Be very, very careful when trying to be "safer". It can unintentionally lead somewhere very different.

> A low tolerance for DNS response times, and suddenly large chunks of the internet are failing a lot...

Hang on a second. I feel that you're piling on other resolver changes in order to make a point. I'm not suggesting that the tolerance for DNS response times be reduced. Nor am I suggesting a scenario where the authority gets one shot after their TTL, after which they're considered dead forever. I would expect my caching DNS resolver to periodically re-attempt to resolve with the authority once we've entered the period after the authority's TTL.

> Leak a route, DDoS a DNS provider, and watch as traffic everywhere goes to an attack server because servers everywhere "protect" people by serving known-stale data rather than failing safe.

I think you're suggesting that someone could commandeer an IP and then prevent the rightful owner to correct their DNS to point to a temporary new IP.

Isn't the real problem in this scenario the ability to commandeer an IP? The malicious actor would also need to be able to provide a valid certificate at the commandeered IP. And at that point, I feel we've got a problem way beyond DNS resolution caching. Besides, if what you have proposed is possible, isn't it also possible against any current domain for the duration of their authoritative TTL? That is, a domain that specifies an 8-hour TTL is vulnerable to exactly this kind of scenario for up to an 8-hour window. Has this IP commandeering and certificate counterfeiting happened before?

> Hang on a second. I feel that you're piling on other resolver changes in order to make a point.

Yes. The point I am making is the additional failure modes that need to be considered and the pain they can cause. Historically have caused.

At no point did I ever think you were suggesting that one failure to respond renders a server dead to your resolver forever. Instead, I expect that your resolver will see a failure to respond from a resolver a high percentage of the time, leading to frequent serving of stale data.

> Isn't the real problem in this scenario the ability to commandeer an IP?

You're absolutely right! The real problem here is the ability to commandeer an IP.

However, that the real problem is in another castle does not excuse technical design decisions that compound the real problem and increase the damage potential.

> Instead, I expect that your resolver will see a failure to respond from a resolver a high percentage of the time, leading to frequent serving of stale data.

If this were true, the current failure mode would have end users receiving NX DOMAIN a "high percentage of the time," which obviously is not happening.

{edit: To be clear, I'm reading the quote as you stating that "failure to resolve" currently happens a high percentage of the time, and therefore this new logic would result in extended TTLs more often than the original post would assume they would happen}

> However, that the real problem is in another castle does not excuse technical design decisions that compound the real problem and increase the damage potential.

It's fair to point out that this change, combined with other known issues could create a "perfect storm," but as was pointed out this exploit is already possible within the current authoritative TTL window. Exploiting the additional caching rules would just be a method of extending that TTL window.

On the other hand, where do you draw the line here? If you had to make sure that no exploits were possible most of the systems that exist today would never have gotten off the ground. It seems a bit like complaining that the locks to the White House can be exploited (picked), while missing the fact that they are only supposed to slow someone down before the "men with guns" can react.

Based on the highly unscientific sample of the set of questions asked by my coworkers in my office today, the failure mode of end users receiving NX DOMAIN has happened much more than on most days.

I don't need to make sure no exploits are possible. However, it at all possible, I'd like to help ensure that things aren't accidentally made more dangerous. It's one thing to consider and make a tradeoff. It's quite another to be ignorant of what the price is.

WRT the second attack, what they're referring to is actually DNS cache poisoning - inserting a false record into the DNS pointing your name at an attacker-controlled IP address. This is a fairly common attack, but usually has an upper time limit - the TTL (which is often limited by DNS servers).

This proposal would allow an attacker to prolong the effects of cache poisoning by running a simultaneous DDoS against un-poisoned upstream DNS servers.

Not sure whether it could be used in a legitimate attack (probably), but it can definitely lead to confusing behavior in some scenarios. You switch servers, your old IP is handed to some random person, your website temporarily goes down - and now your visitors end up at some random website. Would you want that? Especially if you're a business?

Also, "commandeering" an IP of a small hosting might be easier than you think. It depends entirely on how they recycle addresses.

You seem to be continuing to warn against a proposal that isn't the one that was made. What specifically is dangerous about using cached records only in the case of the upstream servers failing to reply?
It doesn't take much of an imagination to attack this.

The older I get in tech the more I realize we just go in circles re-implementing every bad idea over again for the same exact reasons each "generation". Ah well.

TTL is TTL for a reason. It's simple. The publisher is in control, they set their TTL for 60 seconds so obviously they have robust DNS infrastructure they are confident in. They are also signaling with such low TTLs that they require them technically in order to do things like load balance or HA or need them for a DR plan.

Now I get a timeout. Or a negative response. What is the appropriate thing to do? Serve the last record I had? Are you sure? Maybe by doing so I'm actually redirecting traffic they are trying to drain and have now increased traffic at a specific point that is actually contributing to the problem vs. helping. How many queries do I get to serve out of my "best guess" cache before I ask again? How many minutes? Obviously a busy resolver (millions of qps at many ISPs) can't be checking every request so where do you draw the line?

It's just arrogant I suppose. The publisher of that DNS record could set a 30 day TTL if they wanted to, and completely avoid this. But they didn't, and they usually have a reason for that which should be respected. We have standards for a reason.

Assume we serve the last known record after TTL.

Here's the attack:

- Compromise IP (maybe facebook.com)

- DDoS nameservers

- facebook removes IP from rotation

- Users still connect to bad actor even though TTL expired

"We have standards for a reason" is absolutely correct, and we can't start ignoring the standards because someone can't imagine why we need them _at this moment_

Yes, but there's one piece missing.

> Here's the attack:

> - Compromise IP (maybe facebook.com)

- Attacker generates or acquires counterfeit facebook.com certificate.

> - DDoS nameservers

> - facebook removes IP from rotation

> - Users still connect to bad actor even though TTL expired

I understand what you are saying, but this attack scenario is extraordinarily difficult as a means to attack users who have opted to configure their local DNS resolver to retain a last-known-good IP resolution. It involves commandeering an IP and counterfeiting Facebook's SSL/TLS certificate. As I have said elsewhere in this thread, all sites are currently vulnerable to such an attack today for the duration of their TTL window. So if this is a plausible attack vector, we could plausibly see it used now.

The policy that caused so much pain before is to take DNS records, ignore their TTLs, and apply some other arbitrarily selected policy instead. I confess, I don't understand how the proposal at hand is different in ways that prevent the previous pains from recurring.

Maybe you can enlighten me on key differences I've overlooked? How do you define "failing to reply"? Do you ever stop serving records for being stale, or do you store them indefinitely?

So if our resolver was on our resolver was on our laptop and had a nice UI that would work great. Now the question is : why is the resolver not in my laptop?
It can be if you want it to be, but it's probably much less interesting than you think.

You likely underestimate the sheer number of DNS records you look up just by surfing the web, and how useful that information would be to 99.99% of users.

Basically the tools exist for you to do this yourself if you are so inclined, but they may not be that user friendly since they aren't generally useful to most.

Seconded. This is a common idea which occurs to people who haven't dealt with DNS before, and it ends with a much better understanding of how many things https is used for and going back to using openDNS as your resolver.