| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tomp 297 days ago

So why is "distilling from N-gram" better, why does it make the transformer learn English faster?

Hypothesis: it's the standard "teacher-student" or "distillation" trick - if you're learning next-token-prediction, you only learn what the correct answer is (i.e. the spike in probability), but when you're distilling from a teacher model, you learn the entire distribution of potential answers.

Curious, can anyone more experienced in AI research comment on this?