Hacker News new | ask | show | jobs
by gwern 3428 days ago
The averaging part makes it sound like the usual RL self-play against regular checkpoints of oneself.