| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by schipperai 47 days ago
	With most OSS releases being MoEs, and modern GPUs optimized for MoEs, can somebody with knowledge of the topic explain or speculate why Mistral might have opted for a dense model?

1 comments

ac29 47 days ago

Modern GPUs aren't optimized for MoEs though?

The advantage to a dense model like this Mistral one is that it is as smart as a much larger MoE model so it can fit on less GPUs. The tradeoff is that it is much slower since it has to read 100% of its weights for every token, MoE models typically only read about a tenth (though sparsity levels vary).

link

schipperai 47 days ago

Thanks, makes sense. I meant Blackwell is explicitly optimized for MoEs.

link