| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by remontoire 1160 days ago
	I think it's because you want to be able to predict the next token using only 1 token or the whole context window (and any size inbetween). So, you end up getting n different losses for each text snippet (where n is the size of the context window). If i'm wrong, can someone correct here, would be useful to know.

1 comments

guy98238710 1160 days ago

Why would you train the model on shorter context than you can provide? Why not provide all context you have? Sure the model has to learn to handle short context, but that occurs naturally at the beginning of the document.

Anyways, this still involves only left-side masking. Why mask future tokens when sliding window can do that (without wasting a single token of context)?

link