Hacker News new | ask | show | jobs
by brazzledazzle 902 days ago
Did you actually query the DNS from the container to verify DNS was returning an incorrect record in response to the query? I ask because I've seen similar behavior and it turned out the service was only doing DNS lookups at startup and then cached the record indefinitely (or until restarted), regardless of the TTL on the record. Unfortunately some software and libraries don't respond well to even occasional DNS changes.
3 comments

About 15 years ago I worked with a vendor that didn’t realize their web service was ignoring TTLs. I think this was the Java 5 days. We had changed an IP on our end and they kept trying to connect to the wrong one for a webhook. It took weeks of sending tcpdump logs back and forth to convince them. They finally restarted their app.
In our case the problem is kind of the opposite, as far as we could tell.

The TTL is 2 seconds, but because the the app and the service always deploy together and always run on the same mode as one-another. So we deploy and this deploys both the app and the service to a new node, on which both will run.

But because TTL is so low, every new connection (traffic is pretty low for this particular app unlike some other apps in our cluster) is pretty certain to do another DNS lookup. And about 10% of the time we were getting connection error which boiled down to DNS.

So to confirm it was the problem we changed it to not do DNS lookup for now, since it’s as of now always same node for app and service.

But soon we are changing things around and they will no longer be guaranteed to run on same node nor will they deploy together.

So I still need to come up with something that let’s us do DNS lookups but not have the problem we’ve been having.

ugh what a terrible bug especially in the cloud age.