| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by danielhanchen 490 days ago
	Oh no no!! The trick for GRPO is you essentially let the model "learn" how to do reasoning itself!!! The <think> tokens are optional for formatting reasons. You could use <reasoning> or <thinking> or [reasoning] for example in the system prompt.