|
|
|
|
|
by imurray
717 days ago
|
|
The paper says they tried that: https://arxiv.org/abs/2402.14905 Deep link to the relevant snippet in html version:
https://ar5iv.labs.arxiv.org/html/2402.14905#S3.SS5 "So far, we trained compact models from scratch using next tokens as hard labels. We explored Knowledge Distillation (KD)... Unfortunately KD increases training time (slowdown of 2.6−3.2×) and exhibits comparable or inferior accuracy to label-based training (details in appendix)." |
|