|
|
|
|
|
by highfrequency
13 days ago
|
|
Sure, but I'd say that moving desirable trajectories from very low probability to high probability is characteristic of genuine human learning and discovery.
Technically, quantum gravity, a bestselling novel, or a yet undiscovered proof of the Riemann Hypothesis is "in my distribution", but when we are talking about a long chain of unlikely token completions (with multiplicative probabilities), whether that trajectory lives in the tail of the distribution vs. in the mode makes all the difference. Would you agree that it is a matter of degrees rather than a qualitative distinction? There seems to be a broad misconception in Sutton and others that output quality cannot exceed that of the base internet distribution; my point is that RL allows you to easily produce an output distribution that is better than whatever data you trained on according to some evaluation criteria. There are no clear theoretical limits on how much better it can get, rather there are many people asserting guesses that there is an upper bound and it lives below "human creativity." I just haven't seen any solid theoretical argument, and the empirical evidence has so far shown continual improvement. Also, I would be keen to look at any sources you have of pass@k not improving much during GRPO. |
|
Again, no one is saying models can't improve beyond the internet i.e data distribution! They clearly can. The claim is that RL without real exploration cannot exceed the base models distribution, which by virtue of SGD _does_ generalize.
And also, it doesn't mean it's not useful. Improving sample efficiency and making something that happens 1 in 15 times happen 1 in 1.2 times is insanely useful and is what has enabled the kind of coding agents we have today.
Sutton, especially, I doubt has a misconception about this :)
> pass@k
Yeah, AFK now. But it's a well researched thing. You can look for more, but here's one off the top of my head: https://openreview.net/forum?id=4OsgYD7em5 The original deepseek paper also had the result, i.e the paper that first got famous for using grpo as a method that works for LLMs. A side result in one of these papers I forget which one, is that the base model converges in performance with the RLd one at high k.