| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by krackers 552 days ago
	Yeah I find this part not clearly expressed as well. My best guess is that it's not simply binary "correct/incorrect" but rather the reward is made up of multiple parts (e.g. format + correctness) and structured in a way such that "close enough" answers still get some reward. From there I would expect that a base model might at least be able to "autocomplete" the format/style, at which point RL machinery would kick in to tune it to properly obey the format, and once that's mastered eventually correctness. They did mention something about tuning on an un-SFT'd base model being much slower 'warming it up' with some existing reasoning traces.