|
|
|
|
|
by prideout
638 days ago
|
|
Reinforcement learning seems to be key. I understand how traditional fine tuning works for LLMs (i.e. RLHL), but not RL. It seems one popular method is PPO, but I don't understand at all how to implement that. e.g. is backpropagation still used to adjust weights and biases? Would love to read more from something less opaque than an academic paper. |
|
PPO applies this logic to chat responses. If you have a model that can tell you if the response was good, we just need to take the series of actions (each token the model generated) to learn how to generate good responses.
To answer your question, yes you would still use backprop if your model is a neural net.