| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vessenes 76 days ago
	I’ve been doing some low-key testing on smaller models, and it looks to me like it’s possible to train an MOE model with characteristics that are helpful for streaming… For instance, you could add a loss function to penalize expert swapping both in a single forward, pass and across multiple forward passes. So I believe there is a place for thinking about this on the model training side.

1 comments

zozbot234 76 days ago

Penalizing expert swaps doesn't seem like it would help much, because experts vary by layer and are picked layer-wise. There's no guarantee that expert X in layer Y that was used for the previous token will still be available for this token's load from layer Y. The optimum would vary depending on how much memory you have at any given moment, and such. It's not obviously worth optimizing for.

link

vessenes 76 days ago

Right. You need to predict a set of experts through the entire forward pass. Think of a vertical strip.

link