| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by markisus 298 days ago
	Just speculating but proximity to a reference answer is a much denser reward signal. In contrast, parsing out a final answer into a pass/fail only provides a sparse reward signal.

1 comments

krackers 295 days ago

Yup, RLVR as implemented by Deepseek et al. use only outcome supervision instead of process supervision. There have been attempts to do process supervision though.

link