Hacker News new | ask | show | jobs
by big-chungus4 30 days ago
They don't use the old model, they use EMA of the trained model weights as teacher
1 comments

Yeah, and they are not self-distilling in the sense of training on the outputs of the model, but instead operating on the pre-output probability distribution - trying to minimize teacher-student differences.

Still, at the end of the day this is a type of SFT, not continual learning in the human/animal sense.