Hacker News new | ask | show | jobs
by jeremyjh 287 days ago
AlphaZero trained itself through chess games that it played with itself. Chess positions have something very close to an objective truth about the evaluation, the rules are clear and bounded. Winning is measurable. How do you achieve this for a language model?

Yes, distillation is a thing but that is more about compression and filtering. Distillation does not produce new data in the same way that chess games produce new positions.

3 comments

You can have a look at the DeepSeek paper, in particular section "2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Mode".

But generally the idea is that it's, you need some notion of reward, verifiers etc.

Works really well for maths, algorithms, amd many things actually.

See also this very short essay/introduction: https://www.jasonwei.net/blog/asymmetry-of-verification-and-...

That's why we have IMO gold level models now, and I'm pretty confident we'll have superhuman mathematics, algorithmic etc models before long.

Now domains which are very hard to verify - think e.g. theoretical physics etc - that's another story.

> But generally the idea is that it's, you need some notion of reward, verifiers etc.

i dont think youre getting the point hes making.

Synthetic data is already widely used to do training in the programming and mathematics domains where automated verification is possible. Here is an example of an open source verified reasoning synthetic dataset https://www.primeintellect.ai/blog/synthetic-1
Are they actually producing new data though? This is the sort of thing I called "compression and filtering" because it seems to be new information content is not being produced, but LLMs are used to distill the information we already have. We need more raw information.
Yes this is new synthetic data which did not exist before. I encourage you to read the link.
I think we're talking past each other, I'll try once more. Suppose you train an LLM on a very small corpus of data, such as all the content of the library of congress. Then you have that LLM author new works. Then you train a new LLM on the original corpus plus this new material. Do you really think you've addressed the core issue in the SP? Can more parameters be meaningfully trained even if you add more GPU?

To me, the answer is clearly no. There is no new information content in the generated data. Its just a remix of what already exists.

When it comes to logical reasoning, the difficulty isn't about having enough new information, but about ensuring the LLMs capture the right information. The problem LLMs have with learning logical reasoning from standard training is that they learn spurious relationships between the context and the next token, undermining its ability to learn fully general logical reasoning. Synthetic data helps because spurious associations are undermined by the randomness inherent in the synthetic data, forcing the model to find the right generic reasoning steps.
I agree! DeepSeek has shown this is incredibly powerful. I think their Qwen 8B model may be as good as GPT4’s flagship. And I can run it on my laptop if it’s not on my lap. But the amount of synthetic data you can generate is bounded by the raw information, so I don’t think it’s an answer to the SP.
Yes if you have some way to verify the quality of the new works and you only include the high quality works in the new LLM's training set.
But you don't have a way to do that at scale, other than feed it to another LLM that is trained on that exact same limited corpus. There is no new information being added into the system in loops like that. New information means new measurements, new proofs, new signal or media streams from cameras, new curation/rating data, new books or papers etc.
Simple, you just need to turn language into a game.

You make models talk to each other, create puzzles for each other's to solve, ask each other to make cases and evaluate how well they were made.

Will some of it look like ramblings of pre-scientific philosophers? (or modern ones because philosophy never progressed after science left it in the dust)

Sure! But human culture was once there too. And we pulled ourselves out of this nonsense by the bootstraps. We didn't need to be exposed to 3 alien internet's with higher truth.

It's really a miracle that AIs got as much as they did from purely human generated mostly garbage we cared to write down.

I feel like you’re glossing over some very thorny details that it’s not obvious we can solve. For example, if you just get two LLMs setting each other puzzles and scoring the others solutions how do you stop this just collapsing into nonsense? I.e. where does the source of actual truth come from for the puzzles?
> I feel like you’re glossing over some very thorny details that it’s not obvious we can solve.

Yeah. I tried to be funny. It's not that easy. However AI people already started doing it and AI gains perhaps of the last year come mostly from this approach.

> For example, if you just get two LLMs setting each other puzzles and scoring the others solutions how do you stop this just collapsing into nonsense?

That's the trillion dollar question. I wonder how people are doing it. Maybe through economy? You ultimately need to sell your ramblings to somebody to sustain yourself. If you can't, you starve.

Maybe that's enough for AI as well? Companies with AIs that descended into nonsense won't have anymore money to train them further. Maybe companies will need to set up their internal ecosystems of competing AI training organizations and split the budget based on how useful they are becoming?

Phrasing this in a terminology of "truth" is probably counterproductive because there's no truth. There's only what sells. If you have customers in manufacturing probably things that sell will coincide with some physical truths, but this is emergent, not the goal or even part of the process or acquiring capabilities.

>And we pulled ourselves out of this nonsense by the bootstraps.

Human progress was promoted by having to interact with a physical world that anchored our ramblings and gave us a reward function for coherence and cooperation. LLMs would need some analogous anchoring for it to progress beyond incoherent babble.

True, but LLMs got anchored to reality because we are using them in real world tasks and this connection will only grow richer, wider and faster.