| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by krackers 60 days ago
	I think you probably meant this, but when used with RL it's usually KL(π \|\| π_ref), which has high loss when the in-training policy π produces output that's unlikely in the reference. But yeah as you noted, I guess this also means that there is no penalty if π _does not_ produce output in π_ref, which leads to a form of mode-collapse. This collapse in variety matches with what I've seen some studies show that "sloppification" is not present in the base model, and is only introduced during the RL phase.