|
|
|
|
|
by psb217
423 days ago
|
|
That depends a bit on the length of the RL training and the distribution of problems you're training on. You're correct that RL won't get any "traction" (via positive rewards) on problems where good behavior isn't already in the model's behavior distribution. However, if you're training on many problems, it's possible in principle that if you have traction on _any_ of the problems, then the learning signal you get from success on those problems will have a positive effect on the model's behavior on other problems. Ie, the learning that you do on problems where the model is already producing positive reward behavior will nudge the model towards producing positive reward behavior on problems where it wasn't previously doing so. |
|