|
|
|
|
|
by macleginn
424 days ago
|
|
‘Crucially, all correct solutions from RL-trained models already exist in the base model's distribution, proving RLVR enhances sampling efficiency, not reasoning capacity, while inadvertently shrinking the solution space.’ — wouldn't any kind of RL fail to converge or even progress at all if the solution weren't to be found in the base model distribution? The way training is set up, the models absolutely need to be able to find right solutions in a reasonable time, otherwis there wouldn't be any training signal. |
|
However, if you're training on many problems, it's possible in principle that if you have traction on _any_ of the problems, then the learning signal you get from success on those problems will have a positive effect on the model's behavior on other problems. Ie, the learning that you do on problems where the model is already producing positive reward behavior will nudge the model towards producing positive reward behavior on problems where it wasn't previously doing so.