| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by archon 5 hours ago
	I'm uneducated on how distillation works at more than a basic level so forgive me if this is a stupid question. Isn't "distillation" of another provider's model exactly how these models got training date in the first place: Massive amounts of the written word + Prompt -> Answer. Why wouldn't distillation produce similar "reasoning" in the new model? It's just inputs and outputs.

2 comments

maxbond 4 hours ago

What you're describing is (pre-)training. Distillation requires richer labels, the probability distribution over tokens (it would be logits rather than probabilities but that's not important). From a chat transcript you can only understand the argmax/most likely token of that distribution (and only if the API allows you to set the temperature to 0). It's not impossible for an API to give you that but they won't if they don't want you distilling their models.

The intuition is that distillation exploits not only the "right" answer but the relationship between answers (what's the second most right answer? the third? etc).

zozbot234 5 hours ago

Among other things, because you simply can't get those "massive amounts" of text from a SOTA model at reasonable cost. And complex reasoning cannot possibly be trained in a pure one-shot fashion, real post-training takes massive resources. The whole story doesn't add up.