| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by macleginn 424 days ago
	‘Crucially, all correct solutions from RL-trained models already exist in the base model's distribution, proving RLVR enhances sampling efficiency, not reasoning capacity, while inadvertently shrinking the solution space.’ — wouldn't any kind of RL fail to converge or even progress at all if the solution weren't to be found in the base model distribution? The way training is set up, the models absolutely need to be able to find right solutions in a reasonable time, otherwis there wouldn't be any training signal.

2 comments

psb217 424 days ago

That depends a bit on the length of the RL training and the distribution of problems you're training on. You're correct that RL won't get any "traction" (via positive rewards) on problems where good behavior isn't already in the model's behavior distribution.

However, if you're training on many problems, it's possible in principle that if you have traction on _any_ of the problems, then the learning signal you get from success on those problems will have a positive effect on the model's behavior on other problems. Ie, the learning that you do on problems where the model is already producing positive reward behavior will nudge the model towards producing positive reward behavior on problems where it wasn't previously doing so.

link

macleginn 423 days ago

This is an interesting scenario: do you know of any documented examples?

link

psb217 423 days ago

Offhand, I don't know any specific examples for LLMs. In general though, if you google something like "automated curriculum design for reinforcement learning", you should find some relevant references.

Some straightforward scenarios are in, eg, robotics where one can design sequences of increasingly difficult instances of a task like moving objects from one storage bin to another. The basic idea is that the agent would have no reward or learning signal if it jumped straight into the full version of the task, so you let it develop competence on simpler variants and gradually increase difficulty until the agent can get useful learning signal on the full task.

link

mountainriver 423 days ago

I felt like this was already known right? My understanding was always that the base model had all the paths and RL was learning to navigate them

link