|
|
|
|
|
by HarHarVeryFunny
498 days ago
|
|
From reading the R1 paper, it seems the steps were: 1) V3 --RL--> R0 2) R0 generates reasoning data, which is augmented to become "cold start" dataset 3) V3 cold-start-dataset SFT -> intermediate model --RL--> final intermediate model 4) intermediate model generates reasoning data, which is augmented to create 600K reasoning samples, to which is added 200K non-reasoning samples = 800K 5) V3 800k SFT -> R1 --RL--> R1 final Is that not a correct understanding ? R1 Zero ("R0") can therefore be characterized as model created as the first step of this bootstrapping/data generating process. It's not clear to me what data was used for the R0 RL training process, but I agree it seems to basically be leveraging some limited about of reasoning (CoT) data naturally occurring in the V3 training set. |
|