Hacker News new | ask | show | jobs
by jacobr1 725 days ago
One wrinkle, is that it is now common to fine-tune on previously derived RL datasets, with the tested inputs and preferred sample outputs as the training data.