| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mrtesthah 47 days ago
	>"is the RLHF judge happy with the answer." Reinforcement Learning with Verifiable Rewards (RLVR) to improve math and coding success rates seems like an exception.