|
|
|
|
|
by Chio
492 days ago
|
|
We kind-of have that in DeepSeek-R1-zero [1], but it has problem. From the original authors: > With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. A lot of these we can probably solve, but as other have pointed out we want a model that humans can converse with, not an AI for the purpose of other AI. That said, it seems like a promising area of research: > DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. [1] https://github.com/deepseek-ai/DeepSeek-R1 |
|
AlphaGo came before AlphaGo Zero; it was trained on human games, then improved further via self-play. The later AlphaGo Zero proved that pre-training on human games was not necessary, and the model could learn from scratch (i.e. from zero) just via self-play.
For DeepSeek-R1, or any reasoning model, training data is necessary, but hard to come by. One of the main contributions of the DeepSeek-R1 paper was describing their "bootstrapping" (my term) process whereby they started with a non-reasoning model, DeepSeek-V3, and used a three step process to generate more and more reasoning data from that (+ a few other sources) until they had enough to train DeepSeek-R1, which they then further improved with RL.
DeepSeek-R1 Zero isn't a self-play version of DeepSeek-R1 - it was just the result of the first (0th) step of this bootstrapping process whereby they used RL to finetune DeepSeek-V3 into the (somewhat of an idiot savant - one trick pony) R1 Zero model that was then capable of generating training data for the next bootstrapping step.