| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by krackers 552 days ago
	I've been trying to follow the literature on PPO/GRPO as applied to LLMs. From what I understand, since reward is only given once the entire COT sequence is sampled, traditional RL techniques would require some form of credit-assignment to distribute that reward amongst individual tokens – which is where the critic/value network comes in, right? Instead DeepSeek (with GRPO) seems to just omit that value function entirely and use only sparse rewards. How does this end up being more efficient, since I thought the sparse nature of rewards makes it harder to converge to the optimal policy?

1 comments

serialx 552 days ago

I don't think it's only using sparse rewards because of the format rewards. The training recipe is pretty comprehensive and involves multiple stages.[1] The paper mentions that when only using the RL technique, the output is often not suitable for reading. (Language mixing, etc) That feels like a AlphaZero moment for LLMs?

[1]: https://www.reddit.com/r/LocalLLaMA/comments/1i8rujw/notes_o...

link

krackers 552 days ago

The R1 paper says that they didn't use "process reward modeling". And the paper that introduced GPRO says that it can be used either with "outcome supervision" or "process supervision", with outcome supervision "only provid[ing] a reward at the end of each output". Put together, doesn't that imply R1 uses sparse rewards provided only at end of COT sequence?

link

serialx 552 days ago

Ah sorry, you might be right. I meant "sparse reward" as a reward system that is mostly 0 but occasionally 1. Your "sparse reward" means only providing reward at the end of each output.

link

HeatrayEnjoyer 552 days ago

> Ah sorry, you might be right. I meant "sparse reward" as a reward system that is mostly 0 but occasionally 1.

Did we introduce the abusive pressure of Korean educational culture to machines?

link