| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by xg15 309 days ago
	Wouldn't this imply that most of the inference time storage and compute might be unnecessary? If the hypothesis is true, it makes sense to scale up models as much as possible during training - but once the model is sufficiently trained for the task, wouldn't 99% of the weights be literal "dead weight" - because they represent the "failed lottery tickets", i.e. the subnetworks that did not have the right starting values to learn anything useful? So why do we keep them around and waste enormous amounts of storage and compute on them?

5 comments

janalsncm 308 days ago

Quick example, Kimi K2 is a recent large mixture of experts model. Each “expert” is really just a path within it. At each token, 32B out of 1T are active. This means only 3.2% are active for any one token.

link

Sophira 308 days ago

That sounds surprisingly like "Humans only use 10% of their brain at any given time."

link

paulsutter 308 days ago

That’s exactly how it works, read up on pruning. You can ignore most of the weights and still get great results. One issue is that sparse matrices are vastly less efficient to multiply.

But yes you’ve got it

link

tough 309 days ago

someone on twitter was exploring and linked to some related papers where you can for example trim experts on a MoE model if you're 100% sure they're never active for your specific task

what the bigger wide net bigs you is generalization

link

markeroon 309 days ago

Look into pruning

link

FuckButtons 308 days ago

For any particular single pattern learned 99% of the weights are dead weight. But it’s not the same 99% for each lesson learned.

link