Hacker News new | ask | show | jobs
by BarakWidawsky 348 days ago
It’s interesting that it looks like they didn’t apply their own RL to the model, and instead fine tuned on reasoning traces from large datasets and generating reasoning traces from larger models
1 comments

Indeed we opted for offline methods like Anchored Preference Optimization as we found in the Open R1 project that doing multi-task RL on small models is quite a hassle to get right. With offline methods, you focus much more on dataset curation / generation, but that still provides faster iteration cycles for the model scale we’re dealing with!