| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by omneity 310 days ago
	Qwen3 32B is a dense model, it uses all its parameters all the time. GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time. It’s a tradeoff that makes it faster to run than a dense 20B model and much smarter than a 3.6B one. In practice the fairest comparison would be to a dense ~8B model. Qwen Coder 30B A3B is a good sparse comparison point as well.

2 comments

bee_rider 309 days ago

Tangential question from an outsider:

When people talk about sparse or dense models, are they spare or dense matrices in the conventional numerical linear algebra sense? (Something like a csr matrix?)

link

selcuka 310 days ago

> GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time.

They compared it to GPT OSS 120B, which activates 5.1B parameters per token. Given the size of the model it's more than fair to compare it to Qwen3 32B.

link

Mars008 310 days ago

You call it fair? 32 / 5.1 > 6, it's takes 6 times more to compute each token. Put it other way, Qwen3 32B is 6 times slower than GPT OSS 120B.

link

kgeist 310 days ago

>Qwen3 32B is 6 times slower than GPT OSS 120B.

Only if 120B fits entirely in the GPU. Otherwise, for me, with a consumer GPU that only has 32 GB VRAM, gpt-oss 120B is actually 2 times slower than Qwen3 32B (37 tok/sec vs. 65 tok/sec)

link

selcuka 309 days ago

We are talking about accuracy, though. I don't see the point of MoE if a 120B MoE model is not as accurate as even a 32B model.

link

littlestymaar 309 days ago

I've read many times that MoE models should be comparable to dense models with a number of parameters equal to the geometric mean of the MoE's total number of parameters and active ones.

In the case of gpt-oss 120B that would means sqrt(5*120)=24B.

link

Mars008 309 days ago

Not sure there is on formula. Because there are two different cases:

1) performance constrained. like NVidia Spark with 128GB or AGX with 64GB.

2) memory constrained. like consumers' GPUs.

In first case MoE is clear win. They fit and run faster. In second case dense models will produce better results. And if performance in token/sec is acceptable then they are better choice.

link

selcuka 309 days ago

> In the case of gpt-oss 120B that would means sqrt(5*120)=24B.

That's actually in line with what I had (unscientifically) expected. Claude Sonnet 4 seems to agree:

> The most accurate approach for your specific 120B MoE (5.1B active) would be to test it empirically against dense models in the 10-30B range.

link

kgeist 309 days ago

I've read that the formula is based on the early Mistral models and does not necessarily reflect what's going on nowadays.

link