Hacker News new | ask | show | jobs
by pastescreenshot 94 days ago
The result is interesting, but the practical question for me is where the compute bill lands once you include both training and serving. If a fixed-data regime pushes you toward ensembles plus chain distillation, is the endgame “serve the ensemble”, or do you expect most of the gain can be compressed back into a single deployable model later? That seems like the difference between a neat scaling result and a generally usable recipe.
1 comments

oh ensemble can be distilled to a single model easily.
How?
Same way you distill any model. Training data efficiency matters only while you train the source model/ensemble. Once you have that you are purely compute bound during distillation.