I really want to like Alpine, but we (Fly.io) have seen so many DNS issues with customer images that we’re now recommending Ubuntu or Debian slim. The extra ~50mb is a worthwhile trade off to avoid hard to debug musl-libc issues.
Yep. I had an (immutable) JVM ECS service that worked fine for a year, then started failing DNS in just one of several AWS AZs and for just one of various host names--one with something like 24 A records, a few but not an outrageous number. Occasionally forking a process to run 'dig' on the same name made it work for a little while.
AWS support's only advice was "don't use Alpine"; annoyingly, switching the containers to a Debian base cured it, even though this would appear to make absolutely no sense with respect to it failing in just one AZ.
Failing in one AZ probably means they changed something with DNS servers for that region. We had a similar issue recently when we rolled out a new DNS server that returned longer SOA records which broke python's mysql driver in only some regions during the rollout. Debugging nightmare fuel.
Yes, my theory was also some DNS variance in the AZ, but AWS stolidly refused to supply any information to that effect. Adding 'dig +trace' to the equation only deepened the mystery.
Note that some DNS resolvers do not provide truncated UDP results apparently so that might explain some of the weird DNS issues people see https://twitter.com/RichFelker/status/994629795551031296