| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sandGorgon 1212 days ago
	>* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.* what do you mean by this ? The OpenAI papers talk roughly about model performance scaling by parameters. does this show the other way ?

1 comments

vishal0123 1212 days ago

Scaling law is for training till convergence. Both PALM and this model have been undertrained. See the training loss plot in the paper.

link

sandGorgon 1212 days ago

hey thanks for your reply.

umm...so does OpenAI. In fact this is OpenAI discovery from [1]:

>Convergence is inefficient: When working within a fixed compute budget C but without any other restric- tions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would therefore be far more sample efficient than one might expect based on training small models to convergence, with data requirements growing very slowly as D ∼ C0.27 with training compute. (Section 6)

>We have also tested our models on a set of additional text data distributions. The test loss on these datasets as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2 dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the in-distribution validation loss, and does not depend on the duration of training or proximity to convergence. We also observe no dependence on model depth (see Appendix D.8)

P.S. Not trolling. genuinely trying to learn.

[1] https://arxiv.org/abs/2001.08361

link

cubefox 1211 days ago

This is the old scaling laws paper. The scaling laws in it turned out to be wrong and superseded by the Chinchilla DeepMind paper: https://arxiv.org/abs/2203.15556

link

sandGorgon 1210 days ago

hi again - genuinely trying to learn here. The Chinchilla paper is a COMPETING thesis right ? the OpenAI thesis hasnt changed or superseded here.

link

vishal0123 1210 days ago

LLAMA made tradeoff for reducing parameter budget instead of training computation budget. This is better for inference computation budget.

Optimal number of tokens for 7B parameters is around 140B tokens[0], and meta trained it for trillion tokens.

[0]: https://arxiv.org/pdf/2203.15556.pdf

link