|
|
|
|
|
by a1k0n
1057 days ago
|
|
Yes, it has to in fact. If you have zero context to attend to in a transformer, and you try to predict the first token, you effectively are multiplying a zero-vector by the attention head, making all tokens equally likely in the final softmax (unless the lm_head has a bias, but at least in GPT it does not). So the <|beginning of text|> token, with no context before it, learns to predict the first-token-in-a-document distribution. That's not quite the same as predicting nothing at all. |
|