|
|
|
|
|
by madrox
1288 days ago
|
|
Awesome to hear from someone on the other side of the API with knowledge of this. This is ringing bells for me. Yeah, the id parameter was how we knew how many servers there were, and we saw more assignment in some servers than others that neatly mapped to a int32 max failing to divide by the number of IDs we saw. I thought I recall Twitter confirming that was how round robin happened but I could be totally misremembering. We never got a contact to talk to Twitter about it. FWIW, we did eventually see this fixed. I imagine it was pretty easy to spot one server seeing less load than others. The offset was actually how we calculated volume, because millisecond collisions become a variant of the german tank problem[1]. A few times when y'all made tweet volumes public it mapped pretty closely with our estimates. This was around 2011, so your knowledge should be relevant. 1: https://en.wikipedia.org/wiki/German_tank_problem |
|
So initially we used 2 bind servers with authoritative zones serving using round robin. This worked fairly well as the load on each server was high enough to keep the round robin flowing and the connections from ruby would "stick" for 30 seconds anyway. We had a very large pool of front ends so its not like it mattered. Eventually we had to put a local bind cache on each server, but this introduced another super annoying bug. For some reason, the interaction between the authoritative server, and the caching server would end up causing the served record set to be ordered the same. Normally the authoritative server would serve A, B, C for the first query, then B, C, A for the second, etc. When the cache in the middle for some reason it would reset al queries to A, B, C after the refresh happened. So effectively every 30 seconds it would reset to A, B, C and start round robin again. Since the front ends would connect and stay sticky via connection reuse for a while this meant that A got like 50% of the load, B got 33%, and C got 17%. I am guessing you latched on to this by seeing that one of the servers was horribly underutilized in comparison and reproduced the math accidentally. =)
The fix for this was the massively disgusting, but super effective "make every single machine a DNS zone secondary". Rather than having a simple record cache we just fetched the whole zone if it changed. This actually made DNS quite a bit faster as every machine had a local copy of the zone.. Once that happened distribution returned to normal (20/20/20/20/20) fix for all of our internal services which used DNS for discovery, including snowflake.