|
|
|
|
|
by stephenroller
1017 days ago
|
|
One noteworthy thing is that no one is posting validation curves, only training curves. All these models will happily bring training loss eventually to near zero with infinite compute, as the model overfits to the dataset -- there are no regularizers in any modern LLMs. The validation curves would be considerably more convincing. The counter argument to above is that none of these models were really trained for multiple-epochs: it's hard to overfit data you've only seen once. But to go to 70T tokens, you'd inevitably have to start using many epochs. |
|
With a large enough model and many epochs, you can certainly get overfitting, but for one epoch val/train curves look exactly the same and I'd expect that a 7B model will never overfit on 2T tokens no matter how many epochs you do.