Hacker News new | ask | show | jobs
by zaptrem 423 days ago
I think when they were figuring out RLHF they avoided this by interleaving RLHF and normal cross entropy on training set gradients.