|
|
|
|
|
by singhsanjay12
107 days ago
|
|
Nice. Running Unbound locally with intelligent upstream selection and caching definitely reduces blast radius from edge resolver outages. I haven't tried Unbound but I’m curious though, how do you handle recovery behavior when the failure isn’t just recursive resolver unavailability, but scenarios like stale IPs after control plane failover, or long-lived gRPC connections that never re-resolve, or bootstrap loops where the system that needs to reconfigure DNS itself depends on DNS? In my experience, local recursive resolvers solve availability pretty well, but recovery semantics still depend heavily on client behavior and connection lifecycle management. Do you rely on aggressive re-resolution policies at the application layer? Or force connection churn after TTL expiry? Would love to understand how you think about resolver-level resilience vs application-level recovery. |
|
We did not have to do this but in that scenario I would have automation reach out to Unbound and drop the cache for that particular zone or sub-domain. A script could force fetching the new records for any given zone to rebuild the cache.
Or force connection churn after TTL expiry?
The TTL can be kept low and Unbound told to hold the last known IP after resolution accepting this breaks an RFC and the apps may hold onto the wrong IP for too long and then Unbound will request it from upstream again to get the new IP. There is no one right answer. Whomever is the architect for the environment in question would have to decide with methods they believe will be more resilient and then test failure conditions when they do chaos testing. Anywhere there is a gap in resilience should be part of monitoring and automation when the bad behavior can not be eliminated through app/infra configuration.
how you think about resolver-level resilience vs application-level recovery
Well sadly the people managing or architecting the infrastructure may not have any input into how the applications manage DNS. Ideally both groups would meet and discuss options if this is a greenfield deployment. If not then the second best option would be to discuss the platform behavior with a subject matter expert in addition to an operations manager that can summarize all the DNS failures, root cause analysis and restoration methods to determine what behavior should be configured into the stack. Here again there is no one right answer. As a group they will have to decide at which layer DNS retries occur most aggressively and how much input automation will have at the app and infra layers.
The overall priority should be to ensure that past DNS issues known-knowns are designed out of the system. That leaves only unknown-unknowns. to be dealt with in a reactive state, possibly first with automation and then with an operations or SRE team.
Take a look through the Unbound configuration directives [2] to see some of the options available.
[2] - https://nlnetlabs.nl/documentation/unbound/unbound.conf/