| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zone411 1277 days ago
	It doesn't work like this. A 1T model without architectural changes would not perform substantially better unless it has been trained on a lot more code. The original Codex was trained on 100B tokens, so you could possibly get some gains by increasing the model size but only up to a point. See the Chinchilla paper for reference.

2 comments

peaslock 1277 days ago

Not necessarily: https://arxiv.org/abs/2206.14486

Also, even with "Chinchilla laws", you still gain performance in a larger model, you just need a lot more data (if just as noisy) to reach the same level of convergence, but a larger model will have already partially converged to a superior model with the same amount data.

link

zone411 1277 days ago

I've actually seen this paper before, but I don't think it's helpful. If the entire GitHub is 100B tokens and your prune it down properly, then fine, you can get equal performance with fewer tokes. However, if you want improved performance, you still need more data, not just a larger model size, and that's hard to obtain. I don't think it's a lost cause and we will be be stuck with current performance by any means though - there are other ways to go.

link

peaslock 1277 days ago

> if you want improved performance, you still need more data

Not true. See figure 2: https://arxiv.org/pdf/2203.15556.pdf#page=5

The loss decreases with greater model size at the same compute budget (i.e. stopping sooner regarding training data). Also some rehearsal/multi-epoch training improves the forgetting rate (thereby improving performance substantially), which hasn't been taken into account by Chinchilla et al. because they train <1 epoch.

https://arxiv.org/abs/2205.12393

link

zone411 1277 days ago

No. It shows the opposite. All model sizes converged to a similar loss as the compute increased towards maximum. But larger models had larger loss for a given compute budget.

Their text about Figure 3 confirms what I'm saying: "We find a clear valley in loss, meaning that for a given FLOP budget there is an optimal model to train"

link

peaslock 1274 days ago

Yes, but the losses in Figure 3 increase because the larger models see fewer data to keep the FLOP budget constant, not because of overfitting. Large models do not overfit very much, so the loss of a larger model will still be better compared to a smaller model when you keep dataset size constant.

link

karmasimida 1277 days ago

Original Codex is Python only.

link

zone411 1277 days ago

True. I think they're counting duplicated code though. I don't see any mention of de-duplication in their paper.

link