Hacker News new | ask | show | jobs
by cgel 668 days ago
Yes. That is mostly the idea. But calling the state of a linear transformer KV cache is not quite right. A KV cache grows with the sequence length. But the linear transformer state just stores V @ K.T, an object with fixed size.