umm...so does OpenAI. In fact this is OpenAI discovery from [1]:
>Convergence is inefficient: When working within a fixed compute budget C but without any other restric-
tions on the model size N or available data D, we attain optimal performance by training very large models
and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would
therefore be far more sample efficient than one might expect based on training small models to convergence,
with data requirements growing very slowly as D ∼ C0.27 with training compute. (Section 6)
>We have also tested our models on a set of additional text data distributions. The test loss on these datasets
as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2
dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct
parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the
in-distribution validation loss, and does not depend on the duration of training or proximity to convergence.
We also observe no dependence on model depth (see Appendix D.8)
This is the old scaling laws paper. The scaling laws in it turned out to be wrong and superseded by the Chinchilla DeepMind paper: https://arxiv.org/abs/2203.15556
umm...so does OpenAI. In fact this is OpenAI discovery from [1]:
>Convergence is inefficient: When working within a fixed compute budget C but without any other restric- tions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would therefore be far more sample efficient than one might expect based on training small models to convergence, with data requirements growing very slowly as D ∼ C0.27 with training compute. (Section 6)
>We have also tested our models on a set of additional text data distributions. The test loss on these datasets as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2 dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the in-distribution validation loss, and does not depend on the duration of training or proximity to convergence. We also observe no dependence on model depth (see Appendix D.8)
P.S. Not trolling. genuinely trying to learn.
[1] https://arxiv.org/abs/2001.08361