|
|
|
|
|
by ttfvjktesd
264 days ago
|
|
I think one important point is missing here: more data does not automatically lead to better LLMs. If you increase the amount of data tenfold, you might only achieve a slight improvement. We already see that simply adding more and more parameters for instance does not currently make models better. Instead, progress is coming from techniques like reasoning, grounding, post-training, and reinforcement learning, which are the main focus of improvement for state-of-the-art models in 2025. |
|
If you get copies of the same data, it doesn't help. In a similar fashion, going from 100 TBs of data scraped from the internet to 200TBs of data scraped from the internet... does it tell you much more? Unclear.
But there are large categories of data which aren't represented at all in LLMs. Most of the world's data just isn't on the internet. AI for Health is perhaps the most obvious example.