|
|
|
|
|
by ml_hardware
1546 days ago
|
|
I think it depends what downstream task you're trying to do... DeepMind tried distilling big language models into smaller ones (think 7B -> 1B) but it didn't work too well... it definitely lost a lot of quality (for general language modeling) relative to the original model. See the paper here, Figure A28: https://kstatic.googleusercontent.com/files/b068c6c0e64d6f93... But if your downstream task is simple, like sequence classification, then it may be possible to compress the model without losing much quality. |
|