Hacker News new | ask | show | jobs
by codelion 490 days ago
Models already have hidden latent CoT style reasoning within them, GRPO would help induce that behavior. For instance see https://x.com/asankhaya/status/1838375748165628053 where a sampling technique (CoT decoding) can actual improve performance of the model.
1 comments

Oh yep! The deepseek paper also mentioned how large enough LLMs inherently have responding capabilities and the goal of GRPO is to accentuate latent skills!