|
|
|
|
|
by tomp
250 days ago
|
|
So why is "distilling from N-gram" better, why does it make the transformer learn English faster? Hypothesis: it's the standard "teacher-student" or "distillation" trick - if you're learning next-token-prediction, you only learn what the correct answer is (i.e. the spike in probability), but when you're distilling from a teacher model, you learn the entire distribution of potential answers. Curious, can anyone more experienced in AI research comment on this? |
|