LLMs would always bottleneck on one of those two, as computing demand grows crazy quickly with the data amount, and data is necessarily limited. Turns out people threw crazy amounts of compute into it, so the we got the other limit.
There is plenty of data left, we don’t just train with crawled text data. Power constraints may turn out to be the real bottleneck but we’re like 4 orders of magnitude away
There's a limit to that according to: https://www.nature.com/articles/s41586-024-07566-y . Basically, if you use an LLM to augment a training dataset it will become "dumber" every subsequent generation and I am not sure how you can generate synthetic data for a language model without using a language model
Synthetic data doesn't have to come from an LLM. And that paper only showed that if you train on a random sample from an LLM, the resulting second LLM is a worse model of the distribution that the first LLM was trained on. When people construct synthetic data with LLMs, they typically do not just sample at random, but carefully shape the generation process to match the target task better than the original training distribution.
LLMs would always bottleneck on one of those two, as computing demand grows crazy quickly with the data amount, and data is necessarily limited. Turns out people threw crazy amounts of compute into it, so the we got the other limit.