| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ankit219 793 days ago
	Not trying to disparage them, but their models always give a feeling that it is overfitted on benchmarks hence they perform so well. On everyday tasks, it's much worse - chat or simple completion tasks. Distilling can work and there are papers which suggest it does, but we still do not have a reliable mechanism which can distill knowledge from larger teacher models to smaller student models.

1 comments

This was the case for Phi-2, it was notoriously rubbish in practical use.