|
|
|
|
|
by 7777777phil
63 days ago
|
|
KL(P||Q) penalizes Q heavily when it assigns low probability to things P considers likely, but barely cares when Q wastes probability on rare events. That's why KL regularization in RLHF pushes models toward typical, average-sounding outputs.. |
|
This collapse in variety matches with what I've seen some studies show that "sloppification" is not present in the base model, and is only introduced during the RL phase.