| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by swordsmith 239 days ago
	Seems like he thinks RLVR == learning from binary reward for the whole chain, completely discounting techniques to provide denser rewards like process reward supervision?