| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pastescreenshot 94 days ago
	The result is interesting, but the practical question for me is where the compute bill lands once you include both training and serving. If a fixed-data regime pushes you toward ensembles plus chain distillation, is the endgame “serve the ensemble”, or do you expect most of the gain can be compressed back into a single deployable model later? That seems like the difference between a neat scaling result and a generally usable recipe.

1 comments

sdpmas 94 days ago

oh ensemble can be distilled to a single model easily.

link

SknCode 93 days ago

How?

link

sigmoid10 93 days ago

Same way you distill any model. Training data efficiency matters only while you train the source model/ensemble. Once you have that you are purely compute bound during distillation.

link