|
|
|
|
|
by lr1970
492 days ago
|
|
> Distilling means fine-tuning an existing model using outputs from the bigger model. Crucially, the output of the teacher model includes token probabilities so that the fine-tuning is trying to learn the entire output distribution. |
|