| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mountainriver 423 days ago
	There has always been a post training phase with RLHF though since GPT 3.5 It’s nothing new, and it’s worked great for a long time. The difference now is RLVR, which yes I do suspect is causing it to over optimize to verifiable tasks and is probably losing a lot of nuanced information