Hacker News new | ask | show | jobs
by jedberg 495 days ago
It would be interesting to see this on reddit's workload. The entire system was designed around the cache getting a 95%+ hit rate, because basically anything on front page of the top 1000 subreddits will get the overwhelming majority of traffic, so the cache is mostly filled with that.

In other words, this solves the problem of "one hit wonders" getting out of the cache quickly, but that basically already happened with the reddit workload.

The exception to that was Google, which would scrape old pages, and which is why we shunted them to their own infrastructure and didn't cache their requests. Maybe with this algo, we wouldn't have had to do that.

3 comments

Wouldn’t one hit wonders still be an issue? They might get evicted relatively fast anyway but assuming an LRU each will still take a cache entry until they go through the entire thing and finally get evicted.

Although if that’s your concern you can probably just add a smaller admission cache in front of the main cache, possibly with a promotion memory.

That's kind of the idea of Caffeine, it has admission buffers, and it adapts automatically between LRU and LFU. The original algorithm is called Windiw TinyLFU (design https://github.com/ben-manes/caffeine/wiki/Design), see it in action e.g. here: https://github.com/ben-manes/caffeine/wiki/Efficiency
I know that, but I’m not replying to a comment about caffeine, rather the opposite.
I think the idea is that the cache is so large that hot data won't be forced out by one-hit wonders. In this 2017 talk [1], the speaker says that Twitter's SLA depends on having a 99.9% hit rate. It is very common to have extremely over provisioned remote caching tier for popular sites. That makes eviction not as important and reducing their operational costs comes by purging expired data more proactively. Hence, memcached switched from away from relying on its LRU to discard expired entries to using a sweeper. Caffeine's approach, a timing wheel, was considered but dormando felt it was too much of an internal change for memcached and the sweeper could serve multiple purposes.

[1] https://www.youtube.com/watch?v=kxMKnx__uso

what are/were Reddit's top two or three cached structures / things?

guessing post bodies and link previews feels too easy.

comment threads? post listings?

was there a lot of nesting?

it sounds like you're describing a whole post--use message, comments, and all--for presentation to a browser or crawler.

(sorry, saw the handle and have so many questions :D)

Doesn’t reddit use Cassandra, Solr, and Kafka which uses Caffeine?