| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by a1k0n 1057 days ago
	Yes, it has to in fact. If you have zero context to attend to in a transformer, and you try to predict the first token, you effectively are multiplying a zero-vector by the attention head, making all tokens equally likely in the final softmax (unless the lm_head has a bias, but at least in GPT it does not). So the <\|beginning of text\|> token, with no context before it, learns to predict the first-token-in-a-document distribution. That's not quite the same as predicting nothing at all.