Hacker News new | ask | show | jobs
by bradfox2 1062 days ago
It's still loss being backproped, but the loss is calculated over a different criteria
1 comments

Ok that makes a lot of sense.

Why do they call it reinforcement learning then? Is it not traditional RE such as Q learning?

The distinction making it RL is that the model is training on data produced by the model itself.

The benefit of RL in general is that you're training on states the agent is likely to find itself in, and the cost is needing an agent which explores salient states. Which is why we keep seeing RL as a finishing step after imitation (eg AlphaStar first learning StarCraft from replays)

LLM output is scored by another model that produces a reward for the entire sequence emitted by the LLM. The reward model is trained on human preferences or some other metric usually. It's RL because we train on the reward and not some language modeling objective.

The LLM is trained to increase this reward score (or minimize the inverse), which is what makes it RL.