| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vlovich123 115 days ago
	MoE is not suited for paging because it’s essentially a random expert per token. It only improves throughput because you reduce the memory bandwidth requirements for generating a token since 1/n of the weights are accessed per token (but a different 1/n on each loop). Now shrinking them sure, but I’ve seen nothing that indicates you can just page weights in and out without cratering your performance like you would with a non MoE model

1 comments

FuckButtons 115 days ago

Not entirely true, it’s random access within the relevant subset of experts and since concepts are clustered you actually have a much higher probability of repeatedly accessing the same subset of experts more frequently.

link

vlovich123 114 days ago

It’s called mixture of experts but it’s not that concepts map cleanly or even roughly to different experts. Otherwise you wouldn’t get a new expert on every token. You have to remember these were designed to improve throughput in cloud deployments where different GPUs load an expert. There you precisely want each expert to handle randomly to improve your GPU utilization rate. I have not heard anyone training local MoE models to aid sharding.

link

cagenut 114 days ago

is there anywhere good to read/follow to get operational clarity on this stuff?

my current system of looking for 1 in 1000 posts on HN or 1 in 100 on r/locallama is tedious.

link

p1esk 114 days ago

Ask any of the models to explain this to you

link