The policy is how you select your actions -- in this case, the next token. It can be random, but it doesn't have to be. "Deterministically choose the best action" is a valid policy (we would call it the greedy policy), as long as you have some other means of injecting stochasticity so the model explores the space. Uniform random is also a valid policy, as is always selecting the same token (it obviously wouldn't be very performant, and would defeat the purpose here, but it might be fine in, for example, a multi-armed bandit scenario). Most of the time, the policy is a parameterized distribution, and we want to learn the model parameters that maximize some measure of success (the reward component).
Off-policy versus on-policy refers to what data the model is trained on. On-policy training is where the training data is collected by the policy. Off-policy training is where the data was collected by a different sampling process (e.g. we have a standard dataset that we're going to use for supervised training).
It’s quite common these days to treat an LLM as a policy in the sense that it takes as a “state” the previous context, and its task is to choose a continuation, as an “action”. It gets a “reward” from a reward model that was trained on human preferences, or from a verifiable source, such as passing test cases.
This framing has been active for several years, as it’s the framing that enables RLHF and RLVR. RLHF itself is quite old, I think since the original chatGPT.
It's a comment. On Hacker News. Not the RL subreddit, or whatever. I'm just amazed at the jargon. I'm sure it's useful, but one could just call it model output.
> But the probability vector is the output of the LLM, no?
In some contexts yes, but that's not actually the policy. As I wrote in my other comment (quoting because I think it's worth highlighting):
> "the policy is a function that, given some context, assigns probabilities to possible next tokens."
In the same sentence, I also incorrectly referred to this as a "probability distribution", but that's not accurate: it's a function that produces a probability distribution. The policy instantiated at a specific context produces a probability distribution.
In fact, you'd be closer to the mark if you called the policy "the model", but the two terms emphasize different aspects - as I said, "policy" views it from an RL perspective. From that perspective, the policy is a function, the model is an implementation of that function.
Besides, "output of the LLM" is ambiguous. It commonly means the actual generated token(s) (or text), not the probability distribution. Depending on context, "output of the LLM" could refer to (1) logits, (2) the probability distribution, (3) a single selected token, (4) the full generated text.
"Policy" has no such ambiguity - it has a precise definition. That's why technical subjects rely on jargon in the first place, but it results in the exact issue we're running into here: "Jargon enables quick and precise communication among insiders, but it is usually confusing or unintelligible to outsiders."
Yes, I understand one function of jargon, which can be useful to insiders in that it conveys a precise meaning. But, it can be confusing to outsiders, and that is also a useful thing for insiders. In the context of LLMs, what other function can produce p(next token) if not the LLM? And, you just about make my point for me with regards to jargon being confusing by misidentifying what the policy actually is (something i never would have noticed :) In any case, it's an interesting paper. Thanks for all your down votes everyone.
Gemini didn't really say that exactly, did it? Because it's oversimplified to the point of being wrong.
“Policy” here refers to a probability distribution, i.e. a function that, given some context, assigns probabilities to possible next tokens. It's what a model’s behavior looks like when viewed through an RL lens.
The paper discusses “on-policy” and “off-policy” training, which is central to its idea.
Off-policy training is what happens in standard supervised fine-tuning (SFT): the model is trained on examples that were produced independently of the model. This means that the examples have a different distribution than what the model produces. This can have a negative effect on previously learned capabilities.
On-policy training (in this context) uses data generated by the model itself. It samples the model’s own outputs, scores them against whatever results are being trained for, and updates the model based on those scores. This reinforces certain aspects of the model's own pretrained behavior, so is a "gentler" way to change the model's behavior. The authors claim that this reduces "catastrophic forgetting" and other negative consequences of SFT.
Thanks, very good explanation. One question: One could mix both kind of policies, are there hybrid policies? (with samples both from the inner and outer distributions?), if so, how are they named?
Policies are not of two types. There is just _a_ policy. On- and off- policy are properties of the training process. If you learn a policy using data which was generated using another policy, it is off-policy. If the data was generated using the same policy, it is on-policy. The distinction matters because (very loosely) the nudges that the other policy's data tell you to make are based on the other policy's existing shape, which might be different from your current policy's shape. Typically, an algorithm itself is called off-policy if it does not care about the source of the data. Example: Q-learning. An algorithm is called on-policy if it requires the source of the data to be the policy itself. In practice, you always use a mixture of both, and apply techniques such as importance sampling to mitigate the off-policy data mismatch.
To answer your question, yes, you can use any mixture of data for your training process. Whenever you use off-policy data, depending on your objective, you might have to use some technique to "fix" your updates.
Off-policy versus on-policy refers to what data the model is trained on. On-policy training is where the training data is collected by the policy. Off-policy training is where the data was collected by a different sampling process (e.g. we have a standard dataset that we're going to use for supervised training).