|
|
|
|
|
by bradfox2
1056 days ago
|
|
LLM output is scored by another model that produces a reward for the entire sequence emitted by the LLM. The reward model is trained on human preferences or some other metric usually. It's RL because we train on the reward and not some language modeling objective. The LLM is trained to increase this reward score (or minimize the inverse), which is what makes it RL. |
|