|
|
|
|
|
by segmondy
128 days ago
|
|
RLVR. Reinforcement Learning with Verifiable Rewards. Prior to this it was RLHF, reinforcement learning with human feedback. The models can now be trained without human in the loop for coding problems, you give them code to solve. you have a means of verifying the answer. think like a unit test. the model codes it, it fails? it get's a fail. it passes it gets a pass. you do enough of this and the model really learns to code on it's own or operate better as an agent. That's the main thing that has changed between last year and this year. |
|
I could be completely off, as my intuition here is fully based on public research papers, but it seems to explain the current state of things fairly well.