| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ttfvjktesd 311 days ago
	I think one important point is missing here: more data does not automatically lead to better LLMs. If you increase the amount of data tenfold, you might only achieve a slight improvement. We already see that simply adding more and more parameters for instance does not currently make models better. Instead, progress is coming from techniques like reasoning, grounding, post-training, and reinforcement learning, which are the main focus of improvement for state-of-the-art models in 2025.

1 comments

williamtrask 311 days ago

(OP) the scaling laws / bitter lesson would disagree, but I tend to agree with you with some hedging.

If you get copies of the same data, it doesn't help. In a similar fashion, going from 100 TBs of data scraped from the internet to 200TBs of data scraped from the internet... does it tell you much more? Unclear.

But there are large categories of data which aren't represented at all in LLMs. Most of the world's data just isn't on the internet. AI for Health is perhaps the most obvious example.

link

joe_the_user 311 days ago

the scaling laws / bitter lesson would disagree

I have to note that taking the "bitter lesson" position as a claim that more data will result in better LLMs is a wild misinterpretation (or perhaps a "telephone version) of the original bitter lesson article, which say only that general, scalable algorithms do better than knowledge-carrying, problem-specific algorithms. And the last I heard it was the "scaling hypothesis" that hardly had consensus among those in the field.

link

williamtrask 311 days ago

Agree with you on the nuance.

link

CuriouslyC 311 days ago

More data isn't automatically better. You're trying to build the most accurate model of the "true" latent space (estimated from user preference/computational oracles) possible. More data can give you more coverage of the latent space, it can smooth out your estimate of it, and it can let you bake more knowledge in (TBH this is low value though, freshness is a problem). If you add more data that isn't covering a new part of the latent space the value quickly goes to zero as your redundancy increases. Also, you have to be careful when you add data that you aren't giving the model ineffective biases.

link