|
|
|
|
|
by dongobread
719 days ago
|
|
The knowledge distillation is very interesting but generating trillions of outputs from a large teacher model seems insanely expensive. Is this really more cost efficient than just using that compute instead for training your model with more data/more epochs? |
|