|
|
|
|
|
by lumost
6 hours ago
|
|
The KV cache is order dependent and dependent on the context of tokens which exist before the KV cache. There are some transformation approaches to re-use the kv cache across inferences, but none are in wide use due to accuracy concerns following the transformation. |
|
My understanding was that what the KV cache stores is nothing else than the "activations" of the W_k and W_v matrices of an attention module for a given input sequence.
So I don't quite understand how this is supposed to work:
> Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill.
Should a publisher precompute the cache for every popular model that is out there?