| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rughouse 808 days ago
	It’s very similar to Mixture of Experts. But instead of routing tokens to multiple experts, you "deploy to a single expert which can be dynamically skipped"

1 comments

Mixing these would be pretty cool. Further reduced compute for MoE while keeping the performance.

In the paper they already show a mixing of these two with Mixture-of-Depths-and-Experts (MoDE).