Y
Hacker News
new
|
ask
|
show
|
jobs
by
cgel
668 days ago
Yes. That is mostly the idea. But calling the state of a linear transformer KV cache is not quite right. A KV cache grows with the sequence length. But the linear transformer state just stores V @ K.T, an object with fixed size.