Hacker News new | ask | show | jobs
by 7777777phil 63 days ago
KL(P||Q) penalizes Q heavily when it assigns low probability to things P considers likely, but barely cares when Q wastes probability on rare events. That's why KL regularization in RLHF pushes models toward typical, average-sounding outputs..
1 comments

I think you probably meant this, but when used with RL it's usually KL(π || π_ref), which has high loss when the in-training policy π produces output that's unlikely in the reference. But yeah as you noted, I guess this also means that there is no penalty if π _does not_ produce output in π_ref, which leads to a form of mode-collapse.

This collapse in variety matches with what I've seen some studies show that "sloppification" is not present in the base model, and is only introduced during the RL phase.