| HN Mirror

Policies are not of two types. There is just _a_ policy. On- and off- policy are properties of the training process. If you learn a policy using data which was generated using another policy, it is off-policy. If the data was generated using the same policy, it is on-policy. The distinction matters because (very loosely) the nudges that the other policy's data tell you to make are based on the other policy's existing shape, which might be different from your current policy's shape. Typically, an algorithm itself is called off-policy if it does not care about the source of the data. Example: Q-learning. An algorithm is called on-policy if it requires the source of the data to be the policy itself. In practice, you always use a mixture of both, and apply techniques such as importance sampling to mitigate the off-policy data mismatch.

To answer your question, yes, you can use any mixture of data for your training process. Whenever you use off-policy data, depending on your objective, you might have to use some technique to "fix" your updates.