Hacker News new | ask | show | jobs
by sk5t 1557 days ago
Yep. I had an (immutable) JVM ECS service that worked fine for a year, then started failing DNS in just one of several AWS AZs and for just one of various host names--one with something like 24 A records, a few but not an outrageous number. Occasionally forking a process to run 'dig' on the same name made it work for a little while.

AWS support's only advice was "don't use Alpine"; annoyingly, switching the containers to a Debian base cured it, even though this would appear to make absolutely no sense with respect to it failing in just one AZ.

1 comments

Failing in one AZ probably means they changed something with DNS servers for that region. We had a similar issue recently when we rolled out a new DNS server that returned longer SOA records which broke python's mysql driver in only some regions during the rollout. Debugging nightmare fuel.
Yes, my theory was also some DNS variance in the AZ, but AWS stolidly refused to supply any information to that effect. Adding 'dig +trace' to the equation only deepened the mystery.