Hacker News new | ask | show | jobs
by madrox 1288 days ago
Awesome to hear from someone on the other side of the API with knowledge of this. This is ringing bells for me. Yeah, the id parameter was how we knew how many servers there were, and we saw more assignment in some servers than others that neatly mapped to a int32 max failing to divide by the number of IDs we saw. I thought I recall Twitter confirming that was how round robin happened but I could be totally misremembering. We never got a contact to talk to Twitter about it. FWIW, we did eventually see this fixed. I imagine it was pretty easy to spot one server seeing less load than others.

The offset was actually how we calculated volume, because millisecond collisions become a variant of the german tank problem[1]. A few times when y'all made tweet volumes public it mapped pretty closely with our estimates.

This was around 2011, so your knowledge should be relevant.

1: https://en.wikipedia.org/wiki/German_tank_problem

2 comments

I think what your describing was actually a problem that I introduced in our bind configuration back when we used DNS for service discovery. Its not exactly what your describing but I can totally see how it would appear that way externally.

So initially we used 2 bind servers with authoritative zones serving using round robin. This worked fairly well as the load on each server was high enough to keep the round robin flowing and the connections from ruby would "stick" for 30 seconds anyway. We had a very large pool of front ends so its not like it mattered. Eventually we had to put a local bind cache on each server, but this introduced another super annoying bug. For some reason, the interaction between the authoritative server, and the caching server would end up causing the served record set to be ordered the same. Normally the authoritative server would serve A, B, C for the first query, then B, C, A for the second, etc. When the cache in the middle for some reason it would reset al queries to A, B, C after the refresh happened. So effectively every 30 seconds it would reset to A, B, C and start round robin again. Since the front ends would connect and stay sticky via connection reuse for a while this meant that A got like 50% of the load, B got 33%, and C got 17%. I am guessing you latched on to this by seeing that one of the servers was horribly underutilized in comparison and reproduced the math accidentally. =)

The fix for this was the massively disgusting, but super effective "make every single machine a DNS zone secondary". Rather than having a simple record cache we just fetched the whole zone if it changed. This actually made DNS quite a bit faster as every machine had a local copy of the zone.. Once that happened distribution returned to normal (20/20/20/20/20) fix for all of our internal services which used DNS for discovery, including snowflake.

This is awesome and completely retconning a 10 year old project for me. I was working on social media analytics at Disney and we were exploring ways to measure twitter conversation of our brands, which led to us attempting to estimate total conversation volume, which is why this technical nuance was relevant to us. It was a wildly experimental time. Thanks for the story!
This thread is amazing.. thank you both for sharing!
Agree, was super excited when I saw reference to the German tank problem as that is immediately what I thought of in the first post.
I dream of the day in the future, possibly in Eden, when I'll have opportunity to discuss the things I've reverse engineered with the guys and gals who actually engineered them. This conversation in enlightening to witness.

I will have an especially hard time phrasing questions in a respectful manner for the Adobe devs, though ))

Also, totally unrelated Twitter story (because I am nostalgic lately and you reminded me of it)

Early in my time there we had a major, unexpected load spike in the middle of the day. People where tweeting like crazy, breaking records for volume, etc. The system was groaning under strain but it mostly held. Turns out Michael Jackson had died. We sustained 452 (or maybe 457, it was roughly 450-something) tweets a second. this quickly became 1 "MJ" worth of tweets. We informally used this to define load the entire time I was there. Within a few months we got to a point where we were never BELOW 1 MJ, within a year I think we had peaked at double digit MJs and sustained several even in the lowest periods. Before I left we had hit 1 MJ in photo tweets, etc.

Around the time I left we did something like 300 MJ's per second, and I was only there 3 years.

I remember those days before snowflake and Blaine was desperately trying to keep the lights on. It's why no one had time to talk to us (marketers and related disciplines) back then. Even Facebook said "all our focus is on acquiring users. You have no choice but to meet us on our terms because we own the eyeballs." Everything was growing so fast. Hard to believe that was only a decade ago.
Musk claimed a record of 20,000 of tweets per second recently [1]. How does that square with what you’re saying? 300 times 450 is closer to 150,000. Am I missing something?

[1]https://twitter.com/elonmusk/status/1595505413113323520?s=20...

That tweet does not claim 20,000 is a record.

The event also did not reach 20,000, it "almost" hit it.

You’re right. I misinterpreted ‘record usage’. I found a 2013 blog post [1] claiming a 143 thousand tweets per second peak.

[1] https://blog.twitter.com/engineering/en_us/a/2013/new-tweets...

That's interesting; I worked on a different site that also got record traffic when MJ died. I wonder if that happened to every site that had a chat or news feature.

Kind of obvious now, but I bet we could detect major world news just by sampling traffic size of chat sites.