Hacker News new | ask | show | jobs
by peaslock 1277 days ago
Not necessarily: https://arxiv.org/abs/2206.14486

Also, even with "Chinchilla laws", you still gain performance in a larger model, you just need a lot more data (if just as noisy) to reach the same level of convergence, but a larger model will have already partially converged to a superior model with the same amount data.

1 comments

I've actually seen this paper before, but I don't think it's helpful. If the entire GitHub is 100B tokens and your prune it down properly, then fine, you can get equal performance with fewer tokes. However, if you want improved performance, you still need more data, not just a larger model size, and that's hard to obtain. I don't think it's a lost cause and we will be be stuck with current performance by any means though - there are other ways to go.
> if you want improved performance, you still need more data

Not true. See figure 2: https://arxiv.org/pdf/2203.15556.pdf#page=5

The loss decreases with greater model size at the same compute budget (i.e. stopping sooner regarding training data). Also some rehearsal/multi-epoch training improves the forgetting rate (thereby improving performance substantially), which hasn't been taken into account by Chinchilla et al. because they train <1 epoch.

https://arxiv.org/abs/2205.12393

No. It shows the opposite. All model sizes converged to a similar loss as the compute increased towards maximum. But larger models had larger loss for a given compute budget.

Their text about Figure 3 confirms what I'm saying: "We find a clear valley in loss, meaning that for a given FLOP budget there is an optimal model to train"

Yes, but the losses in Figure 3 increase because the larger models see fewer data to keep the FLOP budget constant, not because of overfitting. Large models do not overfit very much, so the loss of a larger model will still be better compared to a smaller model when you keep dataset size constant.