|
|
|
|
|
by janalsncm
638 days ago
|
|
The point of RL is that sometimes you need a model to take actions (you could also call this making predictions) that don’t have a known label. So for example if it’s playing a game, we don’t have a label for each button press. We just have a label for the result at some later time, like whether Pac-Man beat the level. PPO applies this logic to chat responses. If you have a model that can tell you if the response was good, we just need to take the series of actions (each token the model generated) to learn how to generate good responses. To answer your question, yes you would still use backprop if your model is a neural net. |
|