| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by agunapal 45 days ago
	If you really think about why MoE came into existence, its to save significant cost during training, I don't think there was any concrete evidence of performance gains for comparable MoE vs dense models. Over the years, I believe all the new techniques being employed in post training have made the models better.

2 comments

vessenes 45 days ago

I think you mean inference compute? I believe all expert weights are updated in each backward pass during MoE training. The first benefit was getting a sort of structured pruning of weights through the mechanism of expert selection so that the model didn’t need to go through ‘unnecessary’ parts of the model for a given token. This then let inference use memory more efficiently in memory constrained environments, where non-hot or less common experts could be put into slow RAM, or sometimes even streamed off storage.

But I don’t think it necessarily saved training cost; if it did, I’d be interested to learn how!

link

bjourne 45 days ago

Each token is only routed through a few chosen (topk) experts during training. So not all expert weights are updated in the backward pass. Otoh, you may need more training to ensure all experts see enough tokens!

I doubt MoE is actually worth it, given how complicated high-performance expert routing and training is. But who knows, I don't.

link

agunapal 44 days ago

Here is a paper from few years ago where they talk about 7x speed increase, which equates to savings.

https://arxiv.org/abs/2101.03961

link

zozbot234 45 days ago

MoE models will have far more world knowledge than dense models with the same amount of active parameters. MoE is a no-brainer if your inference setup is ultimately limited by compute or memory throughput - not total memory footprint - or alternately if it has fast, high-bandwidth access to lower-tier storage to fetch cold model weights from on demand.

link

regularfry 45 days ago

Yes, this. I can run the 122B Qwen3.5 MoE usably on one 4090 + 64GB RAM. That's a monster of a model, comparatively speaking.

link

aitchnyu 45 days ago

Tangential. I'm a newb, can you name the concept of partitioning weights so we dont need to load whole thing?

link

agunapal 44 days ago

Do you mean model sharding?

link