Hacker News new | ask | show | jobs
by randomcatuser 467 days ago
Wait, what's the difference between using GRPO and traditional fine-tuning of Qwen using your provided dataset?

Would be super interesting to see which one is more data-efficient!

1 comments

Great question! So the dataset includes prompts and solutions, but no "gold" answer per se to use for SFT. You could sample responses from larger models and then train the smaller model on their answers, but as outlined in the benchmarks there is still a lot of headroom on this task and I wouldn't expect that to get the same results. At the very least you would probably want to do rejection sampling to discard bad results. It would definitely be a good experiment!