Hacker News new | ask | show | jobs
by 1a1a11a 899 days ago
Disclaimer: this is the co-author.

Sorry for the hyped language. If my guess is correct, the blog was "polished" by ChatGPT, and the intention was to "polish" the English. The paper has more content and should be read more seriously.

We will fix the animation error and update the writing shortly. Thank you for the feedback!

Regarding some of the complaints on SIEVE performance. Here are some more results that I ran for other purposes. https://observablehq.com/@1a1a11a/sieve-miss-ratio-plots

I haven't had time to add other algorithms to the plot, but they are often similar to or worse than TinyLFU (I did not pick TinyLFU to compare deliberately; I happen to have the figures on hand). While the results do not include byte miss ratio, which is important in some use cases, they are better than the results already shown.

We acknowledge that

* SIEVE may not work well on some block I/O workloads (this has been the statement in our writing from the beginning). However, SIEVE works well on large multi-tenanted block I/O workloads (see the Tencent and Alibaba figure in the link). For web workloads, SIEVE shows consistently good results.

* As the author, when I look at this algorithm. My own feeling is that SIEVE may not be robust because it keeps evicting new objects when the hand moves to the head, effectively behaving as MRU. However, our large evaluations showed that modern workloads have certain properties that enable SIEVE to achieve a low miss ratio. We are still investigating this property and how to model it. If someone is interested, feel free to shoot me an email.

* implementing SIEVE with a log-structured design or circular buffer is not trivial, but we had a design called Segcache (https://segcache.com), which runs a constrained version of SIEVE. In simple libraries, linked list implementation should be easy, and one can check the examples on the website.

3 comments

Share a bit more information on how we got here. SIEVE is the more general form of the FIFO-Merge algorithm in Segcache. When I designed the FIFO-merge algorithm, I thought it was worse than LRU, but it turned out to be better.

At first, I thought FIFO-Merge keeping objects in the insertion order (FIFO) helps the eviction because I mostly worked with Twitter data in which popularity decay is common.

But later when I evaluated more traces (including block I/O workloads), it turned out that quickly evicting new objects (quick demotion) is the source of a low miss ratio. That's why we wrote these papers.

Interested readers are welcome to read more of related work 1. FIFO can be better than LRU, the power of lazy promotion and quick demotion https://dl.acm.org/doi/pdf/10.1145/3593856.3595887

2. FIFO queues are all you need for cache eviction https://dl.acm.org/doi/10.1145/3600006.3613147

How does it compare to "random two choices" [1]?

[1] https://www.eecs.harvard.edu/~michaelm/postscripts/handbook2...

It is simpler to implement than LRU - a single "access timestamp" needed to be attached to data. It also shows performance on par or better than LRU when applied to database caching.

Sorry, I may miss something. How would "random two choices" differ from LRU?
I provided a link, you can read more about differences there.

LRU suffers from linear access pattern, where it cannot degrade gracefully. Try to cache N+1 items in N-item LRU when items are accessed linearly or at random with removal.

The "random two choices" algorithm will replace an entry that is less recently accessed from two randomly selected. There is a chance that some of N+1 linealy accessed items will be in cache after a full scan.

Thank you! Yes, this addresses the scan-resistant issue, but I am not sure how different it would be on workloads without scan. But the workloads we used in evaluations have very few scans, and the power of SIEVE comes from quickly evicting newer objects so I guess "random two" would not improve. But I do not have a number right now, I am happy to add this to the comparison list.
Please, do add the "two random choices" to the list. It is a noticeable omission, let me explain why.

Here are some comparisons of 2-random-choices with LRU in context of CPU caches: https://danluu.com/2choices-eviction/

Relevant quote: "LRU does worse than 2-random when the miss rates are high, and better when the miss rates are low."

"The miss rates are high" is the case when evicting newer but unused in nea future objects is preferable, in my opinion. Thus, "two random choices" algorithm is a direct competitor to SIEVE, they work better than LRU in about same conditions.

I had a try on a few traces; the random two-choice algorithm is only better when there is a big scan, and the LRU miss ratio curve shows a cliff. In all other cases, it is worse than LRU eviction.

Implementation can be found https://github.com/1a1a11a/libCacheSim/blob/develop/libCache...

And Segcache is available in Rust as part of Pelikan project https://github.com/pelikan-io/pelikan
We also published it as a crate https://crates.io/crates/segcache