| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by coder543 79 days ago

For the many DGX Spark and Strix Halo users with 128GB of memory, I believe the ideal model size would probably be a MoE with close to 200B total parameters and a low active count of 3B to 10B.

I would personally love to see a super sparse 200B A3B model, just to see what is possible. These machines don't have a lot of bandwidth, so a low active count is essential to getting good speed, and a high total parameter count gives the model greater capability and knowledge.

It would also be essential to have the Q4 QAT, of course. Then the 200B model weights would take up ~100GB of memory, not including the context.

The common 120B size these days leaves a lot of unused memory on the table on these machines.

I would also like the larger models to support audio input, not just the E2B/E4B models. And audio output would be great too!

2 comments

redman25 79 days ago

200a10b please, 200a3b is too little active to have good intelligence IMO and 10b is still reasonably fast.

link

suprjami 79 days ago

Following the current rule of thumb MoE = `sqrt(param*active)` a 200B-A3B would have the intelligence of a ~24B dense model.

That seems pointless. You can achieve that with a single 24G graphics card already.

I wonder if it would even hold up at that level, as 3B active is really not a lot to work with. Qwen 3.5 uses 122B-A10B and still is neck and neck with the 27B dense model.

I don't see any value proposition for these little boxes like DGX Spark and Strix Halo. Lots of too-slow RAM to do anything useful except run mergekit. imo you'd have been better building a desktop computer with two 3090s.

link

coder543 79 days ago

That rule of thumb was invented years ago, and I don’t think it is relevant anymore, despite how frequently it is quoted on Reddit. It is certainly not the "current" rule of thumb.

For the sake of argument, even if we take that old rule of thumb at face value, you can see how the MoE still wins:

- (DGX Spark) 273GB/s of memory bandwidth with 3B active parameters at Q4 = 273 / 1.5 = 182 tokens per second as the theoretical maximum.

- (RTX 3090) 936GB/s with 24B parameters at Q4 = 936 / 12 = 78 tokens per second. Or 39 tokens per second if you wanted to run at Q8 to maximize the memory usage on the 24GB card.

The "slow" DGX Spark is now more than twice as fast as the RTX 3090, thanks to an appropriate MoE architecture. Even with two RTX 3090s, you would still be slower. All else being equal, I would take 182 tokens per second over 78 any day of the week. Yes, an RTX 5090 would close that gap significantly, but you mentioned RTX 3090s, and I also have an RTX 3090-based AI desktop.

(The above calculation is dramatically oversimplified, but the end result holds, even if the absolute numbers would probably be less for both scenarios. Token generation is fundamentally bandwidth limited with current autoregressive models. Diffusion LLMs could change that.)

The mid-size frontier models are rumored to be extremely sparse like that, but 10x larger on both total and active. No one has ever released an open model that sparse for us to try out.

As I said, I wanted to see what it is possible for Google to achieve.

> Qwen 3.5 uses 122B-A10B and still is neck and neck with the 27B dense model.

From what I've seen, having used both, I would anecdotally report that the 122B model is better in ways that aren't reflected in benchmarks, with more inherent knowledge and more adaptability. But, I agree those two models are quite close, and that's why I want to see greater sparsity and greater total parameters: to push the limits and see what happens, for science.

link

zozbot234 79 days ago

Kimi 2.5 is relatively sparse at 1T/32B; GLM 5 does 744B/40B so only slightly denser. Maybe you could try reducing active expert count on those to artificially increase sparsity, but I'm sure that would impact quality.

link

coder543 79 days ago

Reducing the expert count after training causes catastrophic loss of knowledge and skills. Cerebras does this with their REAP models (although it is applied to the total set of experts, not just routing to fewer experts each time), and it can be okay for very specific use cases if you measure which experts are needed for your use case and carefully choose to delete the least used ones, but it doesn't really provide any general insight into how a higher sparsity model would behave if trained that way from scratch.

link

zozbot234 79 days ago

Large MoE models are too heavily bottlenecked on typical discrete GPUs. You end up pushing just a few common/non-shared layers to GPU and running the MoE part on CPU, because the bandwidth of PCIe transfers to a discrete GPU is a killer bottleneck. Platforms with reasonable amounts of unified memory are more balanced despite the lower VRAM bandwidth, and can more easily run even larger models by streaming inactive weights from SSD (though this quickly becomes overkill as you get increasingly bottlenecked by storage bandwidth: you'd be better off then with a plain HEDT accessing lots of fast storage in parallel via abundant PCIe lanes).

link

girvo 79 days ago

The value prop for the Nvidia one is simple: playing with CUDA with wide enough RAM at okay enough speeds, then running your actual workload on a server someone running the same (not really, lol Blackwell does not mean Blackwell…) architecture.

They’re fine tuning and teaching boxes, not inference boxes. IMO anyway, that’s what mine is for.

link