|
|
|
|
|
by Bender
108 days ago
|
|
For me and also the place I retired from the optimal solutions was an instance of Unbound [1] on every node keeping local cache, retrying edge resolvers intelligently, preferring the fastest responding edge resolvers, cap on min-ttl or both resource records and infrastructure, pre-caching, etc... I've done that at home and when others talk about a DNS outage I have to go out of my way to see or replicate it usually by forcing a flush of the cache. Most Linux distributions have a build of Unbound. I point edge DNS recursive resolvers to the root servers rather than leaking internal systems requests to Cloudflare or Google. Unbound can also be configured to not forward internal names or to point requests for internal names to specific upstream servers. [1] - https://nlnetlabs.nl/projects/unbound/about/ |
|
I haven't tried Unbound but I’m curious though, how do you handle recovery behavior when the failure isn’t just recursive resolver unavailability, but scenarios like stale IPs after control plane failover, or long-lived gRPC connections that never re-resolve, or bootstrap loops where the system that needs to reconfigure DNS itself depends on DNS?
In my experience, local recursive resolvers solve availability pretty well, but recovery semantics still depend heavily on client behavior and connection lifecycle management.
Do you rely on aggressive re-resolution policies at the application layer? Or force connection churn after TTL expiry?
Would love to understand how you think about resolver-level resilience vs application-level recovery.