Hacker News new | ask | show | jobs
by breput 4 hours ago
> We'll assume a 32B dense model, as they've have gotten quite good for production use and a B200 can comfortably serve them. This could be a Gemma, Qwen, DeepSeek, whatever.

That seems like a very consequential point to include halfway through the post. They aren't wrong that Qwen 3.6 26B or Gemma 4 31B are quite good, depending on the use case, but if we're doing napkin math, I'd want some more headroom in the assumptions.

They really ought to have Qwen parameterize their post's calculations and add sliders so a reader could play around with the values.

Edit: And since they especially mentioned DeekSeek (or whatever), as far as I know, none of their current generation of models is a dense model, and even the smallest of the mixture of experts (MoE) models is 284B parameters (13B activated). That will completely incinerate their napkin.

1 comments

Yes 32B dense is a weird one to choose.

But in reality, 32B dense is very similar* to 32B activated on MoE in terms of inference costs. And I highly suspect eg Opus is around that level of active params.

A 284ba13b model at scale, is almost certainly cheaper to serve than a 32b dense model.

*as you can shard the model across multiple GPUs at scale. but in reality you have some loss of efficiency from GPU coordination and expert routing

That's good information. I couldn't possibly even start to run even DeepSeek Flash on my system, but also if you're assuming multiple GPUs, that is going to affect the napkin math.