|
|
|
|
|
by antirez
493 days ago
|
|
That's not what happened. R1-Zero is a model per se, released with a different set of weights. Also it's not an intermediate step obtained making R1. In R1, a first SFT was performed before the RL training. While R1-Zero performed ONLY the RL training (on top of the raw V3). Of course it's hard to argue that R1-Zero and AlphaZero are very similar, since in the case of AlfaZero (I'm referring to the chess model, not Go) only the rules were known to the model, and no human game was shown, while here: 1. The base model is V3, that saw a lot of thigs in pre-training. 2. The RL for the chain of thought has as target math problems that are annotated with the right result. This can be seen as somewhat similar to the chess game finishing with a positive, negative, or draw result. But still... it's text with a problem description. However the similarity is that in the RL used for R1-Zero, the chain of thought to improve problem solving is learned starting cold, without showing the model any CoT to fine tune on it. However the model could sample from the V3 latent space itself that was full of CoT examples of humans, other LLMs, ... |
|
1) V3 --RL--> R0
2) R0 generates reasoning data, which is augmented to become "cold start" dataset
3) V3 cold-start-dataset SFT -> intermediate model --RL--> final intermediate model
4) intermediate model generates reasoning data, which is augmented to create 600K reasoning samples, to which is added 200K non-reasoning samples = 800K
5) V3 800k SFT -> R1 --RL--> R1 final
Is that not a correct understanding ?
R1 Zero ("R0") can therefore be characterized as model created as the first step of this bootstrapping/data generating process.
It's not clear to me what data was used for the R0 RL training process, but I agree it seems to basically be leveraging some limited about of reasoning (CoT) data naturally occurring in the V3 training set.