Hacker News new | ask | show | jobs
by cubefox 202 days ago
I assume they use self-verification only during RL training to provide the reward signal, but not for benchmarks.