| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Bender 108 days ago

For me and also the place I retired from the optimal solutions was an instance of Unbound [1] on every node keeping local cache, retrying edge resolvers intelligently, preferring the fastest responding edge resolvers, cap on min-ttl or both resource records and infrastructure, pre-caching, etc... I've done that at home and when others talk about a DNS outage I have to go out of my way to see or replicate it usually by forcing a flush of the cache.

Most Linux distributions have a build of Unbound. I point edge DNS recursive resolvers to the root servers rather than leaking internal systems requests to Cloudflare or Google. Unbound can also be configured to not forward internal names or to point requests for internal names to specific upstream servers.

[1] - https://nlnetlabs.nl/projects/unbound/about/

1 comments

singhsanjay12 108 days ago

Nice. Running Unbound locally with intelligent upstream selection and caching definitely reduces blast radius from edge resolver outages.

I haven't tried Unbound but I’m curious though, how do you handle recovery behavior when the failure isn’t just recursive resolver unavailability, but scenarios like stale IPs after control plane failover, or long-lived gRPC connections that never re-resolve, or bootstrap loops where the system that needs to reconfigure DNS itself depends on DNS?

In my experience, local recursive resolvers solve availability pretty well, but recovery semantics still depend heavily on client behavior and connection lifecycle management.

Do you rely on aggressive re-resolution policies at the application layer? Or force connection churn after TTL expiry?

Would love to understand how you think about resolver-level resilience vs application-level recovery.

link

Bender 108 days ago

stale IPs after control plane failover

We did not have to do this but in that scenario I would have automation reach out to Unbound and drop the cache for that particular zone or sub-domain. A script could force fetching the new records for any given zone to rebuild the cache.

Or force connection churn after TTL expiry?

The TTL can be kept low and Unbound told to hold the last known IP after resolution accepting this breaks an RFC and the apps may hold onto the wrong IP for too long and then Unbound will request it from upstream again to get the new IP. There is no one right answer. Whomever is the architect for the environment in question would have to decide with methods they believe will be more resilient and then test failure conditions when they do chaos testing. Anywhere there is a gap in resilience should be part of monitoring and automation when the bad behavior can not be eliminated through app/infra configuration.

how you think about resolver-level resilience vs application-level recovery

Well sadly the people managing or architecting the infrastructure may not have any input into how the applications manage DNS. Ideally both groups would meet and discuss options if this is a greenfield deployment. If not then the second best option would be to discuss the platform behavior with a subject matter expert in addition to an operations manager that can summarize all the DNS failures, root cause analysis and restoration methods to determine what behavior should be configured into the stack. Here again there is no one right answer. As a group they will have to decide at which layer DNS retries occur most aggressively and how much input automation will have at the app and infra layers.

The overall priority should be to ensure that past DNS issues known-knowns are designed out of the system. That leaves only unknown-unknowns. to be dealt with in a reactive state, possibly first with automation and then with an operations or SRE team.

Take a look through the Unbound configuration directives [2] to see some of the options available.

[2] - https://nlnetlabs.nl/documentation/unbound/unbound.conf/

link

singhsanjay12 108 days ago

This matches what I've seen too.

Resolver-level resilience is often manageable centrally. The harder part is application-level recovery; especially in larger orgs where DNS behavior spans multiple teams.

Even with low TTLs or cache flush automation, apps may: resolve once at startup or hold long-lived gRPC/TCP connections, or even worse -> ignore TTL semantics entirely

So infra assumes "DNS healed," but the app never re-resolves.

link

Bender 108 days ago

or even worse -> ignore TTL semantics entirely

I sense you may have java in your environment and are probably used to

    export JAVA_OPTS="-Djava.net.preferIPv4Stack=true -Dnetworkaddress.cache.ttl=0"

or something along that line including other options. At least I tried to get teams to use those and then rely on Unbound DNS cache and retry schemes. SystemD also has it's own resolver cache which can be disabled and told to use a local instance of Unbound. Windows servers require Group Policy and registry modifications to change their behavior.

One of my pet peeves is when groups do not manage domain/search correctly and they do not use FQDN in the application configuration resulting in 3x or 4x or more the number of DNS requests which also amplifies all DNS problems/outages. That really grinds my gears.

And of course if the Linux system uses glibc, editing /etc/gai.conf to prefer IPv4 or IPv6 depending on what is primarily used inside the data-center makes a big difference.

link