Hacker News new | ask | show | jobs
by trishume 1252 days ago
Quote tweets I'd do as a reference and they'd basically have the cost of loading 2 tweets instead of one, so increasing the delivery rate by the fraction of tweets that are quote tweets.

Hashtags are a search feature and basically need the same posting lists as for search, but if you only support hashtags the posting lists are smaller. I already have an estimate saying probably search wouldn't fit. But I think hashtag-only search might fit, mainly because my impression is people doing hashtag searches are a small fraction of traffic nowadays so the main cost is disk, not sure though.

I did run the post by 5 ex-Twitter engineers and none of them said any of my estimates were super wrong, mainly just brought up additional features and things I didn't discuss (which I edited into the post before publishing). Still possible that they just didn't divulge or didn't know some number they knew that I estimated very wrong.

3 comments

I think the difficult part would be that tagging and indexing the relationship between a single tweet and all of its component hashtags (which you would then likely want metrics on to avoid needing to count indexes on, etc.) is where it would really start to inflate.

Another poster dug into some implementation details that I'm not going to go into. I think you could shoehorn it into an extremely large server alongside the rest of your project but then you're looking at processing overhead and capacity management around the indexes themselves starting to become a more substantial part of processing power. Consider that for each tweet you need to break out what hashtags are in it, create records, update indexes, and many times there's several hashtags in a given tweet.

When I last ran analytics on the firehose data (ca. 2015/16) I saw something like 20% of all tweets had 3 or more hashtags. I only remember this fact because I built a demo around doing that kind of analytics. That may have changed over time obviously, however without that kind of information we don't have a good guesstimate even of what storage and index management there looks like. I'd be curious if the former Twitter engineers you polled worked on the data storage side of things. Coming at it from the other end of things, I've met more than a few application engineers who genuinely have no clue how much work a DBA (or equivalent) does to get things stored and indexed well and responsively.

Twitter has full-text search, not just hashtags.

Also, the big data storage isn't text, it's images and videos.

You’re missing metadata in your size estimates.
I don’t think that hashtags are a search only feature. In the posts themselves, the hashtags are clickable to view other tweets. I don’t think that qualifies as a search.
It does strike me as a feature you'd typically serve out of some sort of search index since if you had to build search, you'd essentially get indexing of hashtags "for free"
You are probably right and I am wrong. I just looked at a tweet and clicking the hashtag takes to the search page with that hashtag typed in. Probably implemented similarly behind the scenes. Though hashtag most likely does an exact match search instead of fuzzy searching for regular words and phrases.
it does case matching (#hashTag === #hashtag === #HashTag) too