| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bradfox2 1062 days ago
	It's still loss being backproped, but the loss is calculated over a different criteria

1 comments

bilsbie 1062 days ago

Ok that makes a lot of sense.

Why do they call it reinforcement learning then? Is it not traditional RE such as Q learning?

link

dgant 1062 days ago

The distinction making it RL is that the model is training on data produced by the model itself.

The benefit of RL in general is that you're training on states the agent is likely to find itself in, and the cost is needing an agent which explores salient states. Which is why we keep seeing RL as a finishing step after imitation (eg AlphaStar first learning StarCraft from replays)

link

bradfox2 1060 days ago

LLM output is scored by another model that produces a reward for the entire sequence emitted by the LLM. The reward model is trained on human preferences or some other metric usually. It's RL because we train on the reward and not some language modeling objective.

The LLM is trained to increase this reward score (or minimize the inverse), which is what makes it RL.

link