Hacker News new | ask | show | jobs
by Bayano2 57 days ago
This was true circa GPT2, less true after RLHF and not true at all after RLVR. It's trying to model the distribution of outputs most likely to solve the problem, not the average distribution.