|
|
|
|
|
by remontoire
1160 days ago
|
|
I think it's because you want to be able to predict the next token using only 1 token or the whole context window (and any size inbetween). So, you end up getting n different losses for each text snippet (where n is the size of the context window). If i'm wrong, can someone correct here, would be useful to know. |
|
Anyways, this still involves only left-side masking. Why mask future tokens when sliding window can do that (without wasting a single token of context)?