| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by quantumspandex 534 days ago
	Andrej's video is great but the explanation on the RL part is a bit vague to me. How exactly do we train on the right answers? Do we collect the reasoning traces and train on them like supervised learning or do we compute some scores and use them as a loss function ? Isn't the reward then very sparse? What if LLMs can't generate any right answers cause the problems are too hard? Also how can the training of LLMs be parallelized when updating parameters are sequential? Sure we can train on several samples simultaneously, but the parameter updates are with respect to the first step.

3 comments

fenomas 534 days ago

As I understood that part, in RL for LLMs you take questions for which the model already sometimes emits correct answers, and then repeatedly infer while reinforcing the activations the model made during correct responses, which lets it evolve its own ways of more reliably reaching the right answer.

(Hence the analogy to training AlphaGo, wherein you take a model that sometimes wins games, and then play a bunch of games while reinforcing the cases where it won, so that it evolves its own ways of winning more often.)

link

quantumspandex 534 days ago

AlphaGo seems more like an automated process to me because you can start from nothing except the algorithm and the rules. Since a Go game only has 2 outcomes most of the time, and the model can play with itself, it is guaranteed to learn something during self-play.

In the LLM case you have to have an already capable model to do RL. Also I feel like the problem selection part is important to make sure it's not too hard. So there's still much labor involved.

link

fenomas 534 days ago

Yes, IIUC those points are correct - you need relatively capable models, and well-crafted questions. The comparison with AlphaGo is that the processes are analogous, not identical - the key point being that in both cases the model is choosing its own path towards a goal, not just imitating the path that a human labeler took.

link

mtkd 534 days ago

Details on how DS used GRPO for RL rewards

https://medium.com/@sahin.samia/the-math-behind-deepseek-a-d...

link

quantumspandex 534 days ago

Thanks!

link

epr 534 days ago

https://arxiv.org/abs/1707.06347

link

quantumspandex 534 days ago

Will have a look. Thanks!

link