Hacker News new | ask | show | jobs
by HarHarVeryFunny 498 days ago
From reading the R1 paper, it seems the steps were:

1) V3 --RL--> R0

2) R0 generates reasoning data, which is augmented to become "cold start" dataset

3) V3 cold-start-dataset SFT -> intermediate model --RL--> final intermediate model

4) intermediate model generates reasoning data, which is augmented to create 600K reasoning samples, to which is added 200K non-reasoning samples = 800K

5) V3 800k SFT -> R1 --RL--> R1 final

Is that not a correct understanding ?

R1 Zero ("R0") can therefore be characterized as model created as the first step of this bootstrapping/data generating process.

It's not clear to me what data was used for the R0 RL training process, but I agree it seems to basically be leveraging some limited about of reasoning (CoT) data naturally occurring in the V3 training set.