| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by prideout 685 days ago
	Reinforcement learning seems to be key. I understand how traditional fine tuning works for LLMs (i.e. RLHL), but not RL. It seems one popular method is PPO, but I don't understand at all how to implement that. e.g. is backpropagation still used to adjust weights and biases? Would love to read more from something less opaque than an academic paper.

1 comments

janalsncm 685 days ago

The point of RL is that sometimes you need a model to take actions (you could also call this making predictions) that don’t have a known label. So for example if it’s playing a game, we don’t have a label for each button press. We just have a label for the result at some later time, like whether Pac-Man beat the level.

PPO applies this logic to chat responses. If you have a model that can tell you if the response was good, we just need to take the series of actions (each token the model generated) to learn how to generate good responses.

To answer your question, yes you would still use backprop if your model is a neural net.

link

prideout 684 days ago

Thanks, that helps! I still don't quite understand the mechanics of this, since backprop makes adjustments to steer the LLM towards a specific token sequence, not towards a score produced by a reward function.

link

vjerancrnjak 684 days ago

Any RL task needs to decompose the loss.

This was also the issue with RLHF models. The loss of predicting the next token is straightforward to minimize as we know which weights are responsible for the token being correct or not. identifying which tokens had the most sense for a prompt is not straightforward.

For thinking you might generate 32k thinking tokens and then 96k solution tokens and do this a lot of times. Look at the solutions, rank by quality and bias towards better thinking by adjusting the weights for the first 32k tokens. But I’m sure o1 is way past this approach.

link