Hacker News new | ask | show | jobs
by z4y5f3 688 days ago
What they missed is that current scaling laws (OpenAI, Deepmind Chinchilla) are based on the assumption that the model is trained for one epoch. This essentially means that in order to scale compute, you will have to scale the model size and/or the size of the dataset. So Meta cannot simply spend 3.8e25 FLOPs on a 70B model - to do this they must find 86T pretraining tokens which they do not have.

Of course, ultimately we will figure out scaling laws for LLMs trained on multiple epochs of data, but not today.

2 comments

There is some good published research about doing multiple passes over the training data, and how quickly learning saturates. The TL:DR is that diminishing returns kicks in after about 4 epochs.

https://arxiv.org/abs/2305.16264

Yep I have seen this paper before, and thank you for linking it here for reference. My personal opinion is that compared to single epoch scaling laws, we still need more evidence and literature on effects of multiple epochs, but this paper is one of the best results we have so far on using multiple epochs.
But inside on epoch there is a lot of duplication already.

By duplication I mean if context length is N there is many sequence of N word that are not unique.