|
So it's not really hallucinating - it correctly represents "seahorse emoji" internally, but that concept has no corresponding token. lm_head just picks the closest thing and the model doesn't realize until too late. Explains why RL helps. Base models never see their own outputs so they can't learn "this concept exists but I can't actually say it." |