|
|
|
|
|
by amluto
504 days ago
|
|
The part I found strange: these RL formulations give no reward for incorrect solutions, so unless there are training examples that are easy enough for the base model to solve, the RL process won’t do anything. So is the actual magic that the base models are good enough to sometimes generate successful CoT output in their unmodified state? Or did I miss something in the R1 paper and the code here? |
|