Hacker News new | ask | show | jobs
by segmondy 128 days ago
RLVR. Reinforcement Learning with Verifiable Rewards. Prior to this it was RLHF, reinforcement learning with human feedback. The models can now be trained without human in the loop for coding problems, you give them code to solve. you have a means of verifying the answer. think like a unit test. the model codes it, it fails? it get's a fail. it passes it gets a pass. you do enough of this and the model really learns to code on it's own or operate better as an agent. That's the main thing that has changed between last year and this year.
1 comments

and if I was to guess, the latest generation of models (Claude Opus 4.6, GPT-5.3-codex, etc.) differ from Opus 4.5, GPT 5.2 primarily in the addition of deeper, more difficult (most likely agentic and coding-based, like Terminal Bench) tasks to their RLVR training.

I could be completely off, as my intuition here is fully based on public research papers, but it seems to explain the current state of things fairly well.