Hacker News new | ask | show | jobs
by vishal0123 1212 days ago
Scaling law is for training till convergence. Both PALM and this model have been undertrained. See the training loss plot in the paper.
1 comments

hey thanks for your reply.

umm...so does OpenAI. In fact this is OpenAI discovery from [1]:

>Convergence is inefficient: When working within a fixed compute budget C but without any other restric- tions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would therefore be far more sample efficient than one might expect based on training small models to convergence, with data requirements growing very slowly as D ∼ C0.27 with training compute. (Section 6)

>We have also tested our models on a set of additional text data distributions. The test loss on these datasets as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2 dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the in-distribution validation loss, and does not depend on the duration of training or proximity to convergence. We also observe no dependence on model depth (see Appendix D.8)

P.S. Not trolling. genuinely trying to learn.

[1] https://arxiv.org/abs/2001.08361

This is the old scaling laws paper. The scaling laws in it turned out to be wrong and superseded by the Chinchilla DeepMind paper: https://arxiv.org/abs/2203.15556
hi again - genuinely trying to learn here. The Chinchilla paper is a COMPETING thesis right ? the OpenAI thesis hasnt changed or superseded here.
LLAMA made tradeoff for reducing parameter budget instead of training computation budget. This is better for inference computation budget.

Optimal number of tokens for 7B parameters is around 140B tokens[0], and meta trained it for trillion tokens.

[0]: https://arxiv.org/pdf/2203.15556.pdf