Hacker News new | ask | show | jobs
by highfrequency 1 day ago
> RLVR still does not expand beyond the base distribution though, it only mode-seeks within it.

Seems clearly false. Pretraining finds the mean/mode of the data distribution. RL can easily generate many samples around that mode, evaluate them on an external source of truth (eg compile the code and run it) and then selectively train on the good samples. This clearly can go beyond the initial data distribution.

1 comments

by base distribution, I meant the base model's output distribution
The model’s distribution will certainly change from the base model’s output distribution during reinforcement learning, shifting toward outputs that score well on an external evaluation. This is very different from mode-seeking. Am I missing something?
Mode-seeking is describing the way in which the distribution changes. RL is capable of picking out slightly lower probability trajectories and moving them toward the top of the distribution. However, exploration is fundamentally limited by the base policy itself. If a trajectory has near-zero probability under the original model, RLVR is unlikely to discover it because it must first be sampled before it can be rewarded. External search/planning methods such as MCTS or evolutionary search are useful precisely because they can explore candidate trajectories beyond what the policy would ordinarily generate. This is also not theoretical, GRPO style methods are shown to mostly improve `maj@k` and `pass@1` evals while not so much `pass@k` especially for high k, meaning it mostly sharpening the top of the distribution.

I'm not saying this makes it useless - it clearly helps for math and coding tasks. But the ceiling exists and that's what the original tweet was referring to. Alpha evolve also shows what lies beyond the ceiling, altho their planner was rudimentary.

Sure, but I'd say that moving desirable trajectories from very low probability to high probability is characteristic of genuine human learning and discovery. Technically, quantum gravity, a bestselling novel, or a yet undiscovered proof of the Riemann Hypothesis is "in my distribution", but when we are talking about a long chain of unlikely token completions (with multiplicative probabilities), whether that trajectory lives in the tail of the distribution vs. in the mode makes all the difference.

Would you agree that it is a matter of degrees rather than a qualitative distinction? There seems to be a broad misconception in Sutton and others that output quality cannot exceed that of the base internet distribution; my point is that RL allows you to easily produce an output distribution that is better than whatever data you trained on according to some evaluation criteria. There are no clear theoretical limits on how much better it can get, rather there are many people asserting guesses that there is an upper bound and it lives below "human creativity." I just haven't seen any solid theoretical argument, and the empirical evidence has so far shown continual improvement.

Also, I would be keen to look at any sources you have of pass@k not improving much during GRPO.

I said slightly lower, I meant it. It's virtually impossible to sample a trajectory that is really really low probability (say, by smoothening the distribution before sampling) without incurring crazy amounts of noise. And only when you sample it, can you reward it and do the update.

Again, no one is saying models can't improve beyond the internet i.e data distribution! They clearly can. The claim is that RL without real exploration cannot exceed the base models distribution, which by virtue of SGD _does_ generalize.

And also, it doesn't mean it's not useful. Improving sample efficiency and making something that happens 1 in 15 times happen 1 in 1.2 times is insanely useful and is what has enabled the kind of coding agents we have today.

Sutton, especially, I doubt has a misconception about this :)

> pass@k

Yeah, AFK now. But it's a well researched thing. You can look for more, but here's one off the top of my head: https://openreview.net/forum?id=4OsgYD7em5 The original deepseek paper also had the result, i.e the paper that first got famous for using grpo as a method that works for LLMs. A side result in one of these papers I forget which one, is that the base model converges in performance with the RLd one at high k.

Thanks, I appreciate the discussion. The paper you sent is interesting. I agree it looks like for moderate values of K (on the order of 100-1000), RL models actually look a little worse at pass@k than their base models.

So perhaps the right framing of your/Sutton's claim is: RL can upweight low-probability (p) but correct outputs, but there is a limit to how small p can be, and it is on the order of 1 in a 100 or 1 in a 1000. Implicitly there must be some crossover point where you would call this discovery/creativity if it works for sufficiently small p right? Eg if RL can upweight a correct but 1 in a trillion output to 1 in 5, that's got to count as discovery given that all possible sequences are technically "in the distribution"?

In practice, it does seem like that kind of progress is happening. For example with the recent Erdos solution [0], I would wager that GPT 4's hit rate on this would have been functionally 0 (certainly less than 1 in a thousand). Curious to hear whether you'd still say this is mode-seeking within a base distribution, or if not then what is the right explanation if not iterative RL.

I'd also highlight that the paper you linked with the pass@k equivalence doesn't technically address the question of how small p can be before RL upweighting breaks down - all of the example problems were easy enough that the base model had decent hit rate with 128 tries.

[0] https://openai.com/index/model-disproves-discrete-geometry-c...