|
|
|
|
|
by porridgeraisin
1 day ago
|
|
Mode-seeking is describing the way in which the distribution changes. RL is capable of picking out slightly lower probability trajectories and moving them toward the top of the distribution. However, exploration is fundamentally limited by the base policy itself. If a trajectory has near-zero probability under the original model, RLVR is unlikely to discover it because it must first be sampled before it can be rewarded. External search/planning methods such as MCTS or evolutionary search are useful precisely because they can explore candidate trajectories beyond what the policy would ordinarily generate. This is also not theoretical, GRPO style methods are shown to mostly improve `maj@k` and `pass@1` evals while not so much `pass@k` especially for high k, meaning it mostly sharpening the top of the distribution. I'm not saying this makes it useless - it clearly helps for math and coding tasks. But the ceiling exists and that's what the original tweet was referring to. Alpha evolve also shows what lies beyond the ceiling, altho their planner was rudimentary. |
|
Would you agree that it is a matter of degrees rather than a qualitative distinction? There seems to be a broad misconception in Sutton and others that output quality cannot exceed that of the base internet distribution; my point is that RL allows you to easily produce an output distribution that is better than whatever data you trained on according to some evaluation criteria. There are no clear theoretical limits on how much better it can get, rather there are many people asserting guesses that there is an upper bound and it lives below "human creativity." I just haven't seen any solid theoretical argument, and the empirical evidence has so far shown continual improvement.
Also, I would be keen to look at any sources you have of pass@k not improving much during GRPO.