Hacker News new | ask | show | jobs
by antonvs 33 days ago
Gemini didn't really say that exactly, did it? Because it's oversimplified to the point of being wrong.

“Policy” here refers to a probability distribution, i.e. a function that, given some context, assigns probabilities to possible next tokens. It's what a model’s behavior looks like when viewed through an RL lens.

The paper discusses “on-policy” and “off-policy” training, which is central to its idea.

Off-policy training is what happens in standard supervised fine-tuning (SFT): the model is trained on examples that were produced independently of the model. This means that the examples have a different distribution than what the model produces. This can have a negative effect on previously learned capabilities.

On-policy training (in this context) uses data generated by the model itself. It samples the model’s own outputs, scores them against whatever results are being trained for, and updates the model based on those scores. This reinforces certain aspects of the model's own pretrained behavior, so is a "gentler" way to change the model's behavior. The authors claim that this reduces "catastrophic forgetting" and other negative consequences of SFT.

2 comments

> “Policy” here refers to a probability distribution, i.e. a function that, given some context, assigns probabilities to possible next tokens.

This should say "...refers to a function that produces a probability distribution." The latter half of the quoted sentence describes it correctly.

Thanks, very good explanation. One question: One could mix both kind of policies, are there hybrid policies? (with samples both from the inner and outer distributions?), if so, how are they named?
Policies are not of two types. There is just _a_ policy. On- and off- policy are properties of the training process. If you learn a policy using data which was generated using another policy, it is off-policy. If the data was generated using the same policy, it is on-policy. The distinction matters because (very loosely) the nudges that the other policy's data tell you to make are based on the other policy's existing shape, which might be different from your current policy's shape. Typically, an algorithm itself is called off-policy if it does not care about the source of the data. Example: Q-learning. An algorithm is called on-policy if it requires the source of the data to be the policy itself. In practice, you always use a mixture of both, and apply techniques such as importance sampling to mitigate the off-policy data mismatch.

To answer your question, yes, you can use any mixture of data for your training process. Whenever you use off-policy data, depending on your objective, you might have to use some technique to "fix" your updates.