| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by codelion 490 days ago
	Models already have hidden latent CoT style reasoning within them, GRPO would help induce that behavior. For instance see https://x.com/asankhaya/status/1838375748165628053 where a sampling technique (CoT decoding) can actual improve performance of the model.

1 comments

danielhanchen 490 days ago

Oh yep! The deepseek paper also mentioned how large enough LLMs inherently have responding capabilities and the goal of GRPO is to accentuate latent skills!

link