| Yes you're correct! Very good question on SFT vs GRPO! Assume the dataset I have is "What is 2+2?", "The answer is 4". 1. If you have very high quality labelled data, SFT should work fine. Ie "What is 2+2? Let me think about it....., The Answer is 4" 2. If you only have the input "What is 2+2", and just the answer "4", but nothing in between, GRPO could be very helpful! GRPO can help produce the reasoning traces automatically - you will need to provide some scoring / reward functions though. For example if the answer == 4, + 1 score. 3. You can combine SFT and GRPO! Do SFT first, then GRPO - this actually makes GRPO most likely converge faster! |