Hacker News new | ask | show | jobs
by ankit219 793 days ago
Not trying to disparage them, but their models always give a feeling that it is overfitted on benchmarks hence they perform so well. On everyday tasks, it's much worse - chat or simple completion tasks.

Distilling can work and there are papers which suggest it does, but we still do not have a reliable mechanism which can distill knowledge from larger teacher models to smaller student models.

1 comments

This was the case for Phi-2, it was notoriously rubbish in practical use.