Hacker News new | ask | show | jobs
by dgellow 7 days ago
Just curious, do you have links to read more about transformations or other techniques for KV cache reuse?
1 comments

All major model providers offer prefix caching, which is this.
No, reusing segments of the kv cache for different purposes in an order independent manner is an active research area.
Any keyword or paper I can search for?
AsyncResoning[1] does a trick of that sort to give agents concurrent cache views.

You basically have two agents look at the same cache under different views. Say agent_0 gets [a_1, a_0] and agent_1 gets [a_0, a_1]. They also write to this cache concurrently while decoding. To solve positional embedding inconsistencies they rotate the query projections for each block (a_0 and a_1) separately.

The computations you get that way do not exactly match the setup where you would naively prefill on every step, but are close enough.

Same trick could be used for the setup discussed here, I guess: prefill the document cache separately (p), prepend the system prompt (s) and get a cache view [s, p] from which you can then decode.

1. https://arxiv.org/abs/2512.10931

But this would work only for first layer, or am I missing something?