|
|
|
|
|
by rahimnathwani
504 days ago
|
|
From the Deepseek R1 paper: For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader research community.
The impression I got from the paper, although I don't think it was explicitly stated, is that they think distillation will work better than training the smaller models using RL (as OP did). |
|
I found this statement from the paper to be at odds with what you cited, but I guess they mean SFT+RL would be better than either just SFT and RL