| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bjourne 47 days ago
	Each token is only routed through a few chosen (topk) experts during training. So not all expert weights are updated in the backward pass. Otoh, you may need more training to ensure all experts see enough tokens! I doubt MoE is actually worth it, given how complicated high-performance expert routing and training is. But who knows, I don't.