| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by PrayagBhakar 888 days ago
	If we’re talking about model distillation[0] I don’t think the student can ever be better than the teacher as optimising for speed and smaller model sizes inherently means that there will be precision loss. Even if the student is as big as the teacher, there is still data loss. [0] https://arxiv.org/pdf/2210.17332.pdf