Hacker News new | ask | show | jobs
by evrydayhustling 7 days ago
All major model providers offer prefix caching, which is this.
1 comments

No, reusing segments of the kv cache for different purposes in an order independent manner is an active research area.
Any keyword or paper I can search for?
AsyncResoning[1] does a trick of that sort to give agents concurrent cache views.

You basically have two agents look at the same cache under different views. Say agent_0 gets [a_1, a_0] and agent_1 gets [a_0, a_1]. They also write to this cache concurrently while decoding. To solve positional embedding inconsistencies they rotate the query projections for each block (a_0 and a_1) separately.

The computations you get that way do not exactly match the setup where you would naively prefill on every step, but are close enough.

Same trick could be used for the setup discussed here, I guess: prefill the document cache separately (p), prepend the system prompt (s) and get a cache view [s, p] from which you can then decode.

1. https://arxiv.org/abs/2512.10931

But this would work only for first layer, or am I missing something?