| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lumost 47 days ago
	The KV cache is order dependent and dependent on the context of tokens which exist before the KV cache. There are some transformation approaches to re-use the kv cache across inferences, but none are in wide use due to accuracy concerns following the transformation.

4 comments

Eridrus 47 days ago

The paper has a section on "Reusing precomputed KV across queries" which talks about how other papers have tried to address this problem, but yeah, this paper adds nothing on its own besides a catchy title.

link

xg15 47 days ago

Isn't it also, most fundamentally, dependent on the model weights?

My understanding was that what the KV cache stores is nothing else than the "activations" of the W_k and W_v matrices of an attention module for a given input sequence.

So I don't quite understand how this is supposed to work:

> Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill.

Should a publisher precompute the cache for every popular model that is out there?

link

xg15 47 days ago

...not to mention, which KV cache? Every attention module has its own, and how many attention modules there are, what inputs they get, how many internal features and attention heads they have, etc, all depends on the architecture of the specific model.

link

TZubiri 47 days ago

Absolute slop paper. Replace document with text and you'll get it.

"People are asking the same questions and an answer is generated every time, what if we could like cache the questions and their answers..."

Sounds like someone was using chatgpt to understand how chatgpt works and then asked it to generate a paper based on his proposal to improve it.

link

amelius 47 days ago

At least it wasn't a patent.

link

dgellow 47 days ago

Just curious, do you have links to read more about transformations or other techniques for KV cache reuse?

link

evrydayhustling 47 days ago

All major model providers offer prefix caching, which is this.

link

lumost 47 days ago

No, reusing segments of the kv cache for different purposes in an order independent manner is an active research area.

link

dgellow 47 days ago

Any keyword or paper I can search for?

link

dvmazur 47 days ago

AsyncResoning[1] does a trick of that sort to give agents concurrent cache views.

You basically have two agents look at the same cache under different views. Say agent_0 gets [a_1, a_0] and agent_1 gets [a_0, a_1]. They also write to this cache concurrently while decoding. To solve positional embedding inconsistencies they rotate the query projections for each block (a_0 and a_1) separately.

The computations you get that way do not exactly match the setup where you would naively prefill on every step, but are close enough.

Same trick could be used for the setup discussed here, I guess: prefill the document cache separately (p), prepend the system prompt (s) and get a cache view [s, p] from which you can then decode.

1. https://arxiv.org/abs/2512.10931

link

kolinko 47 days ago

But this would work only for first layer, or am I missing something?

link