|
|
|
|
|
by krackers
505 days ago
|
|
I've been trying to follow the literature on PPO/GRPO as applied to LLMs. From what I understand, since reward is only given once the entire COT sequence is sampled, traditional RL techniques would require some form of credit-assignment to distribute that reward amongst individual tokens – which is where the critic/value network comes in, right? Instead DeepSeek (with GRPO) seems to just omit that value function entirely and use only sparse rewards. How does this end up being more efficient, since I thought the sparse nature of rewards makes it harder to converge to the optimal policy? |
|
[1]: https://www.reddit.com/r/LocalLLaMA/comments/1i8rujw/notes_o...