|
|
|
|
|
by dcre
490 days ago
|
|
Distilling means fine-tuning an existing model using outputs from the bigger model. The special technique is in the details of what you choose to generate from the bigger model, how long to train for, and a bunch of other nitty gritty stuff I don’t know about because I’m also not an ML engineer. Google it! |
|
Crucially, the output of the teacher model includes token probabilities so that the fine-tuning is trying to learn the entire output distribution.