|
|
|
|
|
by sailingparrot
236 days ago
|
|
But you are missing the causal attention from your analysis.
The output is not the only thing that is preserved, there is also the KV-cache. At token 1, the model goes through, say, 28 transformer blocks, for each one of those block we save 2 projections of the hidden state in a cache. At token 2, on top of seeing the new token, the model is now also able in each one of those 28 blocks, to look at the previously saved hidden states from token 1. At token 3, it can see the states from token 2 and 1 etc. However I still agree that is not a perfect information-passing mechanism because of how those model are trained (and something like feedback transformer would be better), but information still is very much being passed from earlier tokens to later ones. |
|