|
|
|
|
|
by z4y5f3
688 days ago
|
|
What they missed is that current scaling laws (OpenAI, Deepmind Chinchilla) are based on the assumption that the model is trained for one epoch. This essentially means that in order to scale compute, you will have to scale the model size and/or the size of the dataset. So Meta cannot simply spend 3.8e25 FLOPs on a 70B model - to do this they must find 86T pretraining tokens which they do not have. Of course, ultimately we will figure out scaling laws for LLMs trained on multiple epochs of data, but not today. |
|
https://arxiv.org/abs/2305.16264