Hacker News new | ask | show | jobs
by amluto 504 days ago
The part I found strange: these RL formulations give no reward for incorrect solutions, so unless there are training examples that are easy enough for the base model to solve, the RL process won’t do anything.

So is the actual magic that the base models are good enough to sometimes generate successful CoT output in their unmodified state? Or did I miss something in the R1 paper and the code here?

2 comments

I think is where the relative rewards come to play - they sample many thinking traces and reward those that are correct. This works at the current 'cutting edge' for the model - exactly where it could be improved.
I was wondering the same thing. I feel there is too large of a gap between a raw base model and and a model that produces fully correct answers and follows a specific format. My guess is their rule base reward system is more nuanced than just correctness and format.
Yeah I find this part not clearly expressed as well. My best guess is that it's not simply binary "correct/incorrect" but rather the reward is made up of multiple parts (e.g. format + correctness) and structured in a way such that "close enough" answers still get some reward. From there I would expect that a base model might at least be able to "autocomplete" the format/style, at which point RL machinery would kick in to tune it to properly obey the format, and once that's mastered eventually correctness.

They did mention something about tuning on an un-SFT'd base model being much slower 'warming it up' with some existing reasoning traces.