|
|
|
|
|
by jchandra
63 days ago
|
|
I’ve been exploring KV cache optimization for LLM inference. Most methods (Top-K, sliding window) prune tokens. This works on average, but fails selectively — a few tokens cause large errors when removed. I tried reframing the problem as approximating the attention function: Attn(Q, K, V) Prototype:
- entropy → identify weak tokens
- OLS → reconstruct their contribution
- SVD → compress them Early results show lower error than Top-K at low memory, sometimes even lower memory overall. This is still a small research prototype, would appreciate feedback or pointers to related work. |
|