| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by xg15 9 days ago

Isn't it also, most fundamentally, dependent on the model weights?

My understanding was that what the KV cache stores is nothing else than the "activations" of the W_k and W_v matrices of an attention module for a given input sequence.

So I don't quite understand how this is supposed to work:

> Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill.

Should a publisher precompute the cache for every popular model that is out there?

1 comments

xg15 9 days ago

...not to mention, which KV cache? Every attention module has its own, and how many attention modules there are, what inputs they get, how many internal features and attention heads they have, etc, all depends on the architecture of the specific model.

link