|
|
|
|
|
by jeremyjh
287 days ago
|
|
AlphaZero trained itself through chess games that it played with itself. Chess positions have something very close to an objective truth about the evaluation, the rules are clear and bounded. Winning is measurable. How do you achieve this for a language model? Yes, distillation is a thing but that is more about compression and filtering. Distillation does not produce new data in the same way that chess games produce new positions. |
|
But generally the idea is that it's, you need some notion of reward, verifiers etc.
Works really well for maths, algorithms, amd many things actually.
See also this very short essay/introduction: https://www.jasonwei.net/blog/asymmetry-of-verification-and-...
That's why we have IMO gold level models now, and I'm pretty confident we'll have superhuman mathematics, algorithmic etc models before long.
Now domains which are very hard to verify - think e.g. theoretical physics etc - that's another story.