|
|
|
|
|
by joaogui1
1161 days ago
|
|
You still have one vector per token, that's what they meant, also the fact that the vector associated with each token will ultimately be used to predict the next token, once again showing that it makes sense to talk about other tokens even though they're being transformed inside the model. |
|
It's pedagogically unfortunate that the residual stream is in the same space as the token embeddings, because it obscures how the residual stream is used as a kind of general compressed-information conduit through the model that attention heads read and write different information to to enable the eventual prediction task.