|
|
|
|
|
by pastescreenshot
94 days ago
|
|
The result is interesting, but the practical question for me is where the compute bill lands once you include both training and serving. If a fixed-data regime pushes you toward ensembles plus chain distillation, is the endgame “serve the ensemble”, or do you expect most of the gain can be compressed back into a single deployable model later? That seems like the difference between a neat scaling result and a generally usable recipe. |
|