|
|
|
|
|
by porridgeraisin
8 days ago
|
|
I said slightly lower, I meant it. It's virtually impossible to sample a trajectory that is really really low probability (say, by smoothening the distribution before sampling) without incurring crazy amounts of noise. And only when you sample it, can you reward it and do the update. Again, no one is saying models can't improve beyond the internet i.e data distribution! They clearly can. The claim is that RL without real exploration cannot exceed the base models distribution, which by virtue of SGD _does_ generalize. And also, it doesn't mean it's not useful. Improving sample efficiency and making something that happens 1 in 15 times happen 1 in 1.2 times is insanely useful and is what has enabled the kind of coding agents we have today. Sutton, especially, I doubt has a misconception about this :) > pass@k Yeah, AFK now. But it's a well researched thing. You can look for more, but here's one off the top of my head: https://openreview.net/forum?id=4OsgYD7em5 The original deepseek paper also had the result, i.e the paper that first got famous for using grpo as a method that works for LLMs. A side result in one of these papers I forget which one, is that the base model converges in performance with the RLd one at high k. |
|
So perhaps the right framing of your/Sutton's claim is: RL can upweight low-probability (p) but correct outputs, but there is a limit to how small p can be, and it is on the order of 1 in a 100 or 1 in a 1000. Implicitly there must be some crossover point where you would call this discovery/creativity if it works for sufficiently small p right? Eg if RL can upweight a correct but 1 in a trillion output to 1 in 5, that's got to count as discovery given that all possible sequences are technically "in the distribution"?
In practice, it does seem like that kind of progress is happening. For example with the recent Erdos solution [0], I would wager that GPT 4's hit rate on this would have been functionally 0 (certainly less than 1 in a thousand). Curious to hear whether you'd still say this is mode-seeking within a base distribution, or if not then what is the right explanation if not iterative RL.
I'd also highlight that the paper you linked with the pass@k equivalence doesn't technically address the question of how small p can be before RL upweighting breaks down - all of the example problems were easy enough that the base model had decent hit rate with 128 tries.
[0] https://openai.com/index/model-disproves-discrete-geometry-c...