Hacker News new | ask | show | jobs
by Straw 1018 days ago
The validation curves will look identical. These models are far too small to overfit to the training set.

With a large enough model and many epochs, you can certainly get overfitting, but for one epoch val/train curves look exactly the same and I'd expect that a 7B model will never overfit on 2T tokens no matter how many epochs you do.