Hacker News new | ask | show | jobs
by schipperai 47 days ago
With most OSS releases being MoEs, and modern GPUs optimized for MoEs, can somebody with knowledge of the topic explain or speculate why Mistral might have opted for a dense model?
1 comments

Modern GPUs aren't optimized for MoEs though?

The advantage to a dense model like this Mistral one is that it is as smart as a much larger MoE model so it can fit on less GPUs. The tradeoff is that it is much slower since it has to read 100% of its weights for every token, MoE models typically only read about a tenth (though sparsity levels vary).

Thanks, makes sense. I meant Blackwell is explicitly optimized for MoEs.