|
|
|
|
|
by miven
440 days ago
|
|
The key and value vectors are cached, that's kind of the whole point of autoregressive transformer models, the "state" not only survives within the KV cache but, in some sense, grows continuously with each token added, and is reused for each subsequent token. |
|
But that doesn't change that the only input to the Q, K and V calculations are the tokens (or in later layers information that was derived from the tokens) and each vector in the cache maps directly to an input token.
So I think you could disable the cache and recompute everything in each round and you'd still get the same result, just a lot slower.