| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by onlyrealcuzzo 28 days ago

> I don't disagree, but how much of this ends up being distillation?

You don't need distillation. They already have the training sets.

It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).

3 comments

Philpax 28 days ago

It wouldn't be data distillation: instead, it would be teacher-student distillation. The teacher model has stronger representations that the student can mimic, which would give it more capability over training on the data itself.

link

semiquaver 28 days ago

The frontier labs distill their own base models all day long. It’s not just something done by nefarious Chinese copycats. The knowledge embodied by the internal base models that we never see is much more powerful and useful than the much sparser raw training data

link

coldtea 28 days ago

>It’s not just something done by nefarious Chinese copycats

And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property..

link

flossly 27 days ago

> nefarious Chinese copycats

LLMs are themselves copy cats.

I say thanks for open sourcing and thereby promoting affordable innovation, instead of "nefarious". :)

link

manmal 28 days ago

But how? The training data is the unadulterated content those models are based on? I genuinely don’t understand, no snark.

link

wtallis 27 days ago

Raw training data is raw. A really big model trained on it has already done a first-pass of finding patterns and squeezing out redundancy. Re-ingesting the full training set to train a smaller model is probably more expensive, for marginal quality improvement over distilling from the large model.

link

adgjlsfhk1 27 days ago

Distilling from a larger model is not only probably cheaper than from data, it's also likely higher quality. There's pretty strong support for the proposition that NNs learn a smoothed and regularized version of the data. The NNs are likely higher quality than most of the data they are training from.

link

semiquaver 19 days ago

This guy or gal or person gets it.

link

supern0va 28 days ago

I think you replied to the wrong parent.

link

minimaltom 28 days ago

Frontier labs have their own variants of MLA and certainly their own balance/scaling-laws for things like MoE vs FC vs Attn. MoE scales really well for inference with horizontal scaling + batching, which these guys luv.

On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.

link

onlyrealcuzzo 28 days ago

> Frontier labs have their own variants of MLA

Yes, variants typically 2-3x less good...

Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.

link

amluto 28 days ago

How useful is speculative decoding in a batched setting where you get paid for throughput (aggregated across users) and you mostly don’t get paid for latency or single-session throughput?

link

onlyrealcuzzo 28 days ago

It's useful at the local level, where there will be SOTA models developed...

link

zozbot234 28 days ago

Local models are moving towards batched inference too, if only for non-interactive use. An early experimental patchset for DS4 (running DeepSeek V4 Flash) seems to show 2x aggregate tok/s decode when processing 8 streams concurrently, and more than 3x when processing as many as 32 streams concurrently. Note that prefill (which is not helped significantly by this change) then becomes a larger fraction of total wall-clock time, so the overall gain is lower (i.e. prefill is akin to a 'serial' task wrt. Amdahl's law).

MTP will still be highly valuable for interactive use of course.

link