| HN Mirror

I think it depends what downstream task you're trying to do... DeepMind tried distilling big language models into smaller ones (think 7B -> 1B) but it didn't work too well... it definitely lost a lot of quality (for general language modeling) relative to the original model.

See the paper here, Figure A28: https://kstatic.googleusercontent.com/files/b068c6c0e64d6f93...

But if your downstream task is simple, like sequence classification, then it may be possible to compress the model without losing much quality.