"So far, we trained compact models from scratch using next tokens as hard labels. We explored Knowledge Distillation (KD)... Unfortunately KD increases training time (slowdown of 2.6−3.2×) and exhibits comparable or inferior accuracy to label-based training (details in appendix)."
Deep link to the relevant snippet in html version: https://ar5iv.labs.arxiv.org/html/2402.14905#S3.SS5
"So far, we trained compact models from scratch using next tokens as hard labels. We explored Knowledge Distillation (KD)... Unfortunately KD increases training time (slowdown of 2.6−3.2×) and exhibits comparable or inferior accuracy to label-based training (details in appendix)."