Hacker News new | ask | show | jobs
by zone411 1277 days ago
No. It shows the opposite. All model sizes converged to a similar loss as the compute increased towards maximum. But larger models had larger loss for a given compute budget.

Their text about Figure 3 confirms what I'm saying: "We find a clear valley in loss, meaning that for a given FLOP budget there is an optimal model to train"

1 comments

Yes, but the losses in Figure 3 increase because the larger models see fewer data to keep the FLOP budget constant, not because of overfitting. Large models do not overfit very much, so the loss of a larger model will still be better compared to a smaller model when you keep dataset size constant.