Hacker News new | ask | show | jobs
by aspenmartin 1 hour ago
Yea it’s rejection sampling, so you have an agent, you take a verifiable problem (people use lots of different verification signals but say unit tests etc) and have the agent attempt it K times. You accept the trajectories (all context, tool use etc, the entire log) that are positively verified and use these as training examples.

The trick is to find the examples that are just in between too difficult and too easy for the existing agent, these have the strongest training signals