Hacker News new | ask | show | jobs
by wren6991 6 days ago
Prefix caching is already widely deployed by all providers, right? llama.cpp does it. vLLM does it. I'm sure everyone hosting LLMs for a living does it. This paper seems to focus entirely on prefixes (i.e. the prefilled content is rooted at 0). This is... nothing.

The referenced CacheBlend paper (https://arxiv.org/pdf/2405.16444) which tries to stitch together multiple independent prefills looks more interesting and is new to me. The problem it's trying to solve is:

* KV projections for a given token at a given layer are a function of the residual at that layer,

* which is a function of the attention contribution of the previous layer,

* which is a (nonlinear) function of all earlier tokens' KV values at the previous layer.

This is what stops you from just pasting KV blocks together. Intuitively it might feel like you could do the equivalent of an MPEG deblocking filter to fix up the edges, but there's no guarantee the tokens that need fixing up are at the beginning of the KV block, so they have to be sneaky about it.

Unfortunately while that paper is quite verbose it's lacking in detail in the most important part: how they perform the approximate KV recomputation. It looks like the rough idea is that they fully recompute the KV for the first layer, and use the deviation between the recompute and the original cached KV as a heuristic for how important it is to recompute the full KV values (i.e. all remaining layers) for that token. They use that to derive a mask for the tokens which most strongly attend to the earlier context, then do a sparse update of those tokens.

What's still unclear is how this actually ends up being a performance win, given that the sparse update for each token still requires the exact KV for all the prior tokens in order to actually arrive at the correct value. It just kind of recurses the problem instead of fixing it. Maybe they just use the precomputed KV for the other tokens as input, and live with the approximation?

I think this is already somewhat pragmatically solved: just don't pull huge documents into context. Give the LLM tools to search them and retrieve the fragments that are actually relevant.

1 comments

Yeah, I'm really not sure what the point of this paper is. Every non-toy environment does prefix caching.
Yes, but presumably the authors are suggesting broader application than just caching a system prompt.

The paper's approach should work well if (a) you can calculate KV(A || B) as a function of KV(A) and KV(B) independently, (b) you can identify which documents A1, A2, A3, ... are used commonly enough to be worth caching, and (c) it is cheaper to buy and sell KV(A) on a market than to compute KV(A) when it is needed. Given the size of KV(A) I am not sure that (c) will become true even if people solve the open research problem represented by (a) and accept the state-of-the-art trade-offs known for (b).

> Yes, but presumably the authors are suggesting broader application than just caching a system prompt

The authors of the OP paper "Can I Buy Your KV Cache?" explicitly disregard anything involving KV not rooted at 0:

>> We deliberately study the simplest, safe form: a document treated as a shared prefix, with continuations appended after it

So no, I really think it's just prefix caching. That's actually far from the weirdest thing about that paper: they go on to "prove" that decoding from cached prefill gets the same result as prefilling and decoding on the same content, which... yes. That is how computation works.

Also, the thing they describe already exists: you pay your provider for their cache implementation as part of your token ingress costs. What is that if not paying for cached KV?