| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by t55 382 days ago
	> prolonged RL training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling does this mean that previous RL papers claiming the opposite were possibly bottlenecked by small datasets?

1 comments

yorwba 382 days ago

No, they do not point to any specific examples of novel reasoning strategies that were uncovered, nor is their sampling that extensive (at most 256 samples vs the 2048 used in https://limit-of-rlvr.github.io/ ).

link

grad62304977 378 days ago

Seems unreasonable to say that in figure 5 for example, that more sampling (of a reasonable amount) would push the base to 100%

link

t55 382 days ago

so you think it's fake news? another example of a paper with strong claims without much evidence?

link

yorwba 382 days ago

I think it's a case of not coming up with alternative explanations for the observed evidence and hence not designing experiments to distinguish between those explanations.

Their results are consistent with novel reasoning strategies, but they're also consistent with more reliable execution of reasoning strategies that the base model can generate in principle, but rarely succeeds at due to a large number of steps. (If you have a model that can do each step independently with 99% success rate and getting the correct result requires 1000 steps, the chance of making it all the way to the end without a single error is only about 0.004%.)

link

psb217 382 days ago

One challenge with this line of argument is that the base model assigns non-zero probability to all possible sequences if we ignore truncation due to numerical precision. So, in a sense you could say any performance improvement is due to shifting probability mass towards good reasoning behaviors and away from bad ones that were already present in the base model.

I agree with your general point though. Ie, we need more thorough empirical investigation of how reasoning behavior evolves during RL training starting from the base model. And, current RL training results seem more like "amplifying existing good behavior" than "inducing emergent good behavior".

link

yorwba 382 days ago

While it's true that the model assigns non-zero probabilities to all sequences by design, those probabilities can get a lot smaller. E.g. replace that 99% per-step success probability with 10% and suddenly the overall chance of a correct result is truly astronomically small.

For a novel reasoning strategy, I would expect at least a few individual tokens where the base model assigns much smaller probabilities than the reinforcement-learning trained one, as opposed to just being a little smaller but spread out over many tokens. (Which would better fit a "death by a thousand cuts" scenario.)

link