| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jeremyjh 287 days ago
	I think we're talking past each other, I'll try once more. Suppose you train an LLM on a very small corpus of data, such as all the content of the library of congress. Then you have that LLM author new works. Then you train a new LLM on the original corpus plus this new material. Do you really think you've addressed the core issue in the SP? Can more parameters be meaningfully trained even if you add more GPU? To me, the answer is clearly no. There is no new information content in the generated data. Its just a remix of what already exists.

2 comments

hackinthebochs 287 days ago

When it comes to logical reasoning, the difficulty isn't about having enough new information, but about ensuring the LLMs capture the right information. The problem LLMs have with learning logical reasoning from standard training is that they learn spurious relationships between the context and the next token, undermining its ability to learn fully general logical reasoning. Synthetic data helps because spurious associations are undermined by the randomness inherent in the synthetic data, forcing the model to find the right generic reasoning steps.

link

jeremyjh 286 days ago

I agree! DeepSeek has shown this is incredibly powerful. I think their Qwen 8B model may be as good as GPT4’s flagship. And I can run it on my laptop if it’s not on my lap. But the amount of synthetic data you can generate is bounded by the raw information, so I don’t think it’s an answer to the SP.

link

voxic11 287 days ago

Yes if you have some way to verify the quality of the new works and you only include the high quality works in the new LLM's training set.

link

jeremyjh 286 days ago

But you don't have a way to do that at scale, other than feed it to another LLM that is trained on that exact same limited corpus. There is no new information being added into the system in loops like that. New information means new measurements, new proofs, new signal or media streams from cameras, new curation/rating data, new books or papers etc.

link