|
|
|
|
|
by highfrequency
9 hours ago
|
|
Thanks, I appreciate the discussion. The paper you sent is interesting. I agree it looks like for moderate values of K (on the order of 100-1000), RL models actually look a little worse at pass@k than their base models. So perhaps the right framing of your/Sutton's claim is: RL can upweight low-probability (p) but correct outputs, but there is a limit to how small p can be, and it is on the order of 1 in a 100 or 1 in a 1000. Implicitly there must be some crossover point where you would call this discovery/creativity if it works for sufficiently small p right? Eg if RL can upweight a correct but 1 in a trillion output to 1 in 5, that's got to count as discovery given that all possible sequences are technically "in the distribution"? In practice, it does seem like that kind of progress is happening. For example with the recent Erdos solution [0], I would wager that GPT 4's hit rate on this would have been functionally 0 (certainly less than 1 in a thousand). Curious to hear whether you'd still say this is mode-seeking within a base distribution, or if not then what is the right explanation if not iterative RL. I'd also highlight that the paper you linked with the pass@k equivalence doesn't technically address the question of how small p can be before RL upweighting breaks down - all of the example problems were easy enough that the base model had decent hit rate with 128 tries. [0] https://openai.com/index/model-disproves-discrete-geometry-c... |
|
> Discovery / creativity
I'm absolutely uninterested in the semantic discussions of what is a real discovery, what is creativity, what is intelligence, etc. I simply don't care. If it's useful great use it. If it's not great don't.
> How small p can be
All that depends on your sampling procedure. If you intentionally smooth the distribution out you can sample the smallest thing, but you pay for it with noise. Taken to an extreme, this is the monkeys typing on the keyboard argument.
It's a mathematical fact that RL cannot improve things it doesn't sample. In any learned distribution you pay a heavy cost by sampling far away from the mode. Most RL algos sample rollouts maybe with some smoothing but that's it. This is why external planners are necessary in order to sample something effectively un-sampleable in the base distribution. Simple example: tool use!
Sutton and everyone are simply calling for a focus on improving these external planners in the same way, as they also enable much better "continual" learning and so on.
> Erdos solution
The RL was what enabled such a huge trajectory to ever become efficiently sampleable in our lifetimes probably. You can do many useful things like this and more purely with the base model distribution.
In fact. Doing RL on user chats and so on especially from pair coding sessions are improving these models coding abilities by a lot making them even more reliable for SWE. In this regard, mode-seeking is a win.
> All sequences are technically in distribution
If it was truly improving 1 in million things systemically, then you wouldn't see base getting the same results given many samples. Albeit they are not erdos problems.
Could it be that at 1T scale, and for difficult problems specifically, grpo somehow filters through the noise and picks out the 1 in trillion? Extremely unlikely (you have your expected rollouts required to sample that, and then you have your sparse reward signal and no credit assignment on top of that...). But of course, only 2 companies in the world can do experiments with it, so there could be some unknown effect the rest of the world has not seen. Barring that, no.