Hacker News new | ask | show | jobs
by okl 2684 days ago
> If you have a tight loop, LRU is going to be perfect as long as the loop fits in cache, but it's going to cause a miss every time if the loop doesn't fit. A random eviction policy degrades gracefully as the loop gets too big.

That's the key take-away from my experience with optimizing control loops for microcontrollers with small caches. As soon as your loop is too bit, your code suddenly runs at a hundredth of the speed.

2 comments

I am not sure I understand what you mean by the size of the "loop".

Conventional LRU uses linked-lists and when the LL no longer fits in the L1/L2 caches, you will see the performance hit you mention (given that accessing the associated LL node and reordering the list will most likely result in two cache misses).

Try libclc [1] for a compact segmented cache of n 64B containers that exactly fit a cache line, with each container having its own eviction policy. The overhead of LRU per container (of 7 data lines) is 9.14 bits/entry and the same cache line access will give you both the data lines and the LRU order.

[1]: https://github.com/alphazero/libclc

> Conventional LRU uses linked-lists and when the LL no longer fits in the L1/L2 caches

He refers to the LRU-like policy of the I/D cache itself.

Got it. Thanks for the clarification.
It goes beyond microcontrollers to every step up the chain. You want to optimize you system to hit the cache at every level as much as possible. The processor and its cache. The database and its cache. Web servers and their cache. Because going outside of the cache at any level usually incurs a performance hit orders of magnitude worse.