Hacker News new | ask | show | jobs
by guy98238710 1160 days ago
Why would you train the model on shorter context than you can provide? Why not provide all context you have? Sure the model has to learn to handle short context, but that occurs naturally at the beginning of the document.

Anyways, this still involves only left-side masking. Why mask future tokens when sliding window can do that (without wasting a single token of context)?