Hacker News new | ask | show | jobs
by liquidgecka 1280 days ago
I think what your describing was actually a problem that I introduced in our bind configuration back when we used DNS for service discovery. Its not exactly what your describing but I can totally see how it would appear that way externally.

So initially we used 2 bind servers with authoritative zones serving using round robin. This worked fairly well as the load on each server was high enough to keep the round robin flowing and the connections from ruby would "stick" for 30 seconds anyway. We had a very large pool of front ends so its not like it mattered. Eventually we had to put a local bind cache on each server, but this introduced another super annoying bug. For some reason, the interaction between the authoritative server, and the caching server would end up causing the served record set to be ordered the same. Normally the authoritative server would serve A, B, C for the first query, then B, C, A for the second, etc. When the cache in the middle for some reason it would reset al queries to A, B, C after the refresh happened. So effectively every 30 seconds it would reset to A, B, C and start round robin again. Since the front ends would connect and stay sticky via connection reuse for a while this meant that A got like 50% of the load, B got 33%, and C got 17%. I am guessing you latched on to this by seeing that one of the servers was horribly underutilized in comparison and reproduced the math accidentally. =)

The fix for this was the massively disgusting, but super effective "make every single machine a DNS zone secondary". Rather than having a simple record cache we just fetched the whole zone if it changed. This actually made DNS quite a bit faster as every machine had a local copy of the zone.. Once that happened distribution returned to normal (20/20/20/20/20) fix for all of our internal services which used DNS for discovery, including snowflake.

1 comments

This is awesome and completely retconning a 10 year old project for me. I was working on social media analytics at Disney and we were exploring ways to measure twitter conversation of our brands, which led to us attempting to estimate total conversation volume, which is why this technical nuance was relevant to us. It was a wildly experimental time. Thanks for the story!
This thread is amazing.. thank you both for sharing!
Agree, was super excited when I saw reference to the German tank problem as that is immediately what I thought of in the first post.
I dream of the day in the future, possibly in Eden, when I'll have opportunity to discuss the things I've reverse engineered with the guys and gals who actually engineered them. This conversation in enlightening to witness.

I will have an especially hard time phrasing questions in a respectful manner for the Adobe devs, though ))