Hacker News new | ask | show | jobs
by mountainriver 423 days ago
There has always been a post training phase with RLHF though since GPT 3.5

It’s nothing new, and it’s worked great for a long time. The difference now is RLVR, which yes I do suspect is causing it to over optimize to verifiable tasks and is probably losing a lot of nuanced information