| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by danielhanchen 490 days ago

Yes you're correct!

Very good question on SFT vs GRPO!

Assume the dataset I have is "What is 2+2?", "The answer is 4".

1. If you have very high quality labelled data, SFT should work fine. Ie "What is 2+2? Let me think about it....., The Answer is 4"

2. If you only have the input "What is 2+2", and just the answer "4", but nothing in between, GRPO could be very helpful! GRPO can help produce the reasoning traces automatically - you will need to provide some scoring / reward functions though. For example if the answer == 4, + 1 score.

3. You can combine SFT and GRPO! Do SFT first, then GRPO - this actually makes GRPO most likely converge faster!

2 comments

sidkshatriya 490 days ago

Does this mean that you can only do GRPO on the training models that have reasoning traces in <think>...</think>

link

danielhanchen 490 days ago

Oh no at all!! You can actually convert a model to even generate the <think>...</think> tokens themselves! That's how DeepSeek trained R1 Zero, which essentially made the model have reasoning skills!

link

sidkshatriya 490 days ago

Wont you have to use a distilled DeepThink model then ? Because the training phase with GRPO required to its reasoning within <think></think> for least loss.

link

danielhanchen 490 days ago

Oh no no!! The trick for GRPO is you essentially let the model "learn" how to do reasoning itself!!!

The <think> tokens are optional for formatting reasons. You could use <reasoning> or <thinking> or [reasoning] for example in the system prompt.

link

codelion 490 days ago

Models already have hidden latent CoT style reasoning within them, GRPO would help induce that behavior. For instance see https://x.com/asankhaya/status/1838375748165628053 where a sampling technique (CoT decoding) can actual improve performance of the model.

link

danielhanchen 490 days ago

Oh yep! The deepseek paper also mentioned how large enough LLMs inherently have responding capabilities and the goal of GRPO is to accentuate latent skills!

link

wrsh07 490 days ago

Nah, you can just request that in your prompt and then fail answers that are incorrect and/or don't include the think trace

link

danielhanchen 490 days ago

Yes exactly! You can in fact add that has a reward function for style and format checking!

link

lyu07282 490 days ago

can you give some real-world examples for when this would be useful? Does this work for tasks requiring tool calling as well?

link

danielhanchen 490 days ago

Yes tool calling is a prime example!! Ie you have some specific task, and the final output involving some tools, but sadly the steps to call the tools / the stuff in between / the thinking process is missing.

You can employ GRPO and maybe add an actual Python environment for the model to learn to act in.

link

byefruit 490 days ago

I'm waiting for https://github.com/huggingface/trl/pull/2810 to land. I think this should work with the existing unsloth setup without changes.

link

danielhanchen 490 days ago

Oh yes!! Will has definitely been on a roll!! Excited for the PR as well!

link