| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by canyon289 76 days ago

We are always figuring out what parameter size makes sense.

The decision is always a mix between how good we can make the models from a technical aspect, with how good they need to be to make all of you super excited to use them. And its a bit of a challenge what is an ever changing ecosystem.

I'm personally curious is there a certain parameter size you're looking for?

9 comments

coder543 76 days ago

For the many DGX Spark and Strix Halo users with 128GB of memory, I believe the ideal model size would probably be a MoE with close to 200B total parameters and a low active count of 3B to 10B.

I would personally love to see a super sparse 200B A3B model, just to see what is possible. These machines don't have a lot of bandwidth, so a low active count is essential to getting good speed, and a high total parameter count gives the model greater capability and knowledge.

It would also be essential to have the Q4 QAT, of course. Then the 200B model weights would take up ~100GB of memory, not including the context.

The common 120B size these days leaves a lot of unused memory on the table on these machines.

I would also like the larger models to support audio input, not just the E2B/E4B models. And audio output would be great too!

redman25 76 days ago

200a10b please, 200a3b is too little active to have good intelligence IMO and 10b is still reasonably fast.

suprjami 76 days ago

Following the current rule of thumb MoE = `sqrt(param*active)` a 200B-A3B would have the intelligence of a ~24B dense model.

That seems pointless. You can achieve that with a single 24G graphics card already.

I wonder if it would even hold up at that level, as 3B active is really not a lot to work with. Qwen 3.5 uses 122B-A10B and still is neck and neck with the 27B dense model.

I don't see any value proposition for these little boxes like DGX Spark and Strix Halo. Lots of too-slow RAM to do anything useful except run mergekit. imo you'd have been better building a desktop computer with two 3090s.

coder543 76 days ago

That rule of thumb was invented years ago, and I don’t think it is relevant anymore, despite how frequently it is quoted on Reddit. It is certainly not the "current" rule of thumb.

For the sake of argument, even if we take that old rule of thumb at face value, you can see how the MoE still wins:

- (DGX Spark) 273GB/s of memory bandwidth with 3B active parameters at Q4 = 273 / 1.5 = 182 tokens per second as the theoretical maximum.

- (RTX 3090) 936GB/s with 24B parameters at Q4 = 936 / 12 = 78 tokens per second. Or 39 tokens per second if you wanted to run at Q8 to maximize the memory usage on the 24GB card.

The "slow" DGX Spark is now more than twice as fast as the RTX 3090, thanks to an appropriate MoE architecture. Even with two RTX 3090s, you would still be slower. All else being equal, I would take 182 tokens per second over 78 any day of the week. Yes, an RTX 5090 would close that gap significantly, but you mentioned RTX 3090s, and I also have an RTX 3090-based AI desktop.

(The above calculation is dramatically oversimplified, but the end result holds, even if the absolute numbers would probably be less for both scenarios. Token generation is fundamentally bandwidth limited with current autoregressive models. Diffusion LLMs could change that.)

The mid-size frontier models are rumored to be extremely sparse like that, but 10x larger on both total and active. No one has ever released an open model that sparse for us to try out.

As I said, I wanted to see what it is possible for Google to achieve.

> Qwen 3.5 uses 122B-A10B and still is neck and neck with the 27B dense model.

From what I've seen, having used both, I would anecdotally report that the 122B model is better in ways that aren't reflected in benchmarks, with more inherent knowledge and more adaptability. But, I agree those two models are quite close, and that's why I want to see greater sparsity and greater total parameters: to push the limits and see what happens, for science.

zozbot234 76 days ago

Kimi 2.5 is relatively sparse at 1T/32B; GLM 5 does 744B/40B so only slightly denser. Maybe you could try reducing active expert count on those to artificially increase sparsity, but I'm sure that would impact quality.

coder543 76 days ago

Reducing the expert count after training causes catastrophic loss of knowledge and skills. Cerebras does this with their REAP models (although it is applied to the total set of experts, not just routing to fewer experts each time), and it can be okay for very specific use cases if you measure which experts are needed for your use case and carefully choose to delete the least used ones, but it doesn't really provide any general insight into how a higher sparsity model would behave if trained that way from scratch.

zozbot234 76 days ago

Large MoE models are too heavily bottlenecked on typical discrete GPUs. You end up pushing just a few common/non-shared layers to GPU and running the MoE part on CPU, because the bandwidth of PCIe transfers to a discrete GPU is a killer bottleneck. Platforms with reasonable amounts of unified memory are more balanced despite the lower VRAM bandwidth, and can more easily run even larger models by streaming inactive weights from SSD (though this quickly becomes overkill as you get increasingly bottlenecked by storage bandwidth: you'd be better off then with a plain HEDT accessing lots of fast storage in parallel via abundant PCIe lanes).

girvo 76 days ago

The value prop for the Nvidia one is simple: playing with CUDA with wide enough RAM at okay enough speeds, then running your actual workload on a server someone running the same (not really, lol Blackwell does not mean Blackwell…) architecture.

They’re fine tuning and teaching boxes, not inference boxes. IMO anyway, that’s what mine is for.

NitpickLawyer 76 days ago

Jeff Dean apparently didn't get the message that you weren't releasing the 124B Moe :D

Was it too good or not good enough? (blink twice if you can't answer lol)

coder68 76 days ago

120B would be great to have if you have it stashed away somewhere. GPT-OSS-120B still stands as one of the best (and fastest) open-weights models out there. A direct competitor in the same size range would be awesome. The closest recent release was Qwen3.5-122B-A10B.

kcb 76 days ago

Nemotron 3 Super was released recently. That's a direct competitor to gpt-oss-120b. https://developer.nvidia.com/blog/introducing-nemotron-3-sup...

evilduck 76 days ago

In terms of ability, maybe, in terms of speed, it's not even close. Check out the Prompt Processing speeds between them: https://kyuz0.github.io/amd-strix-halo-toolboxes/

gpt-oss-120b is over 600 tokens/s PP for all but one backend.

nemotron-3-super is at best 260 tokens/s PP.

Comparing token generation, it's again like 50 tokens/sec vs 15 tokens/sec

That really bogs down agentic tooling. Something needs to be categorically better to justify halving output speed, not just playing in the margins.

mratsim 76 days ago

In my case with vLLM on dual RTX Pro 6000

gpt-oss-120b: (unknown prefill), ~175 tok/s generation. I don't remember the prefill speed but it certainly was below 10k

Nemotron-3-Super: 14070 tok/s prefill, ~194.5 tok/s generation. (Tested fresh after reload, no caching, I have a screenshot.)

Nemotron-3-Super using NVFP4 and speculative decoding via MTP 5 tokens at a time as mentioned in Nvidia cookbook: https://docs.nvidia.com/nemotron/nightly/usage-cookbook/Nemo...

coder68 70 days ago

Hmm you might be able to tweak the settings further. Under llama.cpp on one RTX 6000 Pro I get ~215 tok/s generation speed. The key for me was setting min_p greater than 0. My settings:

``` #!/bin/bash

llama-server \ -hf ggml-org/gpt-oss-120b-GGUF \ -c 0 \ -np 1 \ --jinja \ --no-mmap \ --temp 1.0 \ --top-p 1.0 \ --min-p 0.001 \ --chat-template-kwargs '{"reasoning_effort": "high"}' \ --host 0.0.0.0 ```

coder68 76 days ago

I gave it a whirl but was unenthused. I'll try it again, but so far have not really enjoyed any of the nvidia models, though they are best in class for execution speed.

markab21 76 days ago

I'll pipe in here as someone working on an agentic harness project using mastra as the harness.

Nemotron3-super is, without question, my favorite model now for my agentic use cases. The closest model I would compare it to, in vibe and feel, is the Qwen family but this thing has an ability to hold attention through complicated (often noisy) agentic environments and I'm sometimes finding myself checking that i'm not on a frontier model.

I now just rent a Dual B6000 on a full-time basis for myself for all my stuff; this is the backbone of my "base" agentic workload, and I only step up to stronger models in rare situations in my pipelines.

The biggest thing with this model, I've found, is just making sure my environment is set up correctly; the temps and templates need to be exactly right. I've had hit-or-miss with OpenRouter. But running this model on a B6000 from Vast with a native NVFP4 model weight from Nvidia, it's really good. (2500 peak tokens/sec on that setup) batching. about 100/s 1-request, 250k context. :)

I can run on a single B6000 up to about 120k context reliably but really this thing SCREAMS on a dual-b6000. (I'm close to just ordering a couple for myself it's working so well).

Good luck .. (Sometimes I feel like I'm the crazy guy in the woods loving this model so much, I'm not sure why more people aren't jumping on it..)

girvo 76 days ago

> I'm not sure why more people aren't jumping on it

Simple: most of the people you’re talking to aren’t setting these things up. They’re running off the shelf software and setups and calling it a day. They’re not working with custom harnesses or even tweaking temperature or templates, most of them.

pertymcpert 75 days ago

I’d be very interested in trying it if you could spare the time to write up how to tune it well. If not thanks for the input anyway.

WarmWash 76 days ago

Mainline consumer cards are 16GB, so everyone wants models they can run on their $400 GPU.

NekkoDroid 76 days ago

Yea, I've been waiting a while for a model that is ~12-13GB so there is still a bit of extra headroom for all the different things running on the system that for some reason eat VRAM.

vparseval 76 days ago

I found that you can run models locally pretty well that exceed your VRAM by a bit. At least ollama will hand excess off to your system RAM. Maybe performance suffers but I've never actually seen it crap out and I can wait a few minutes for a response.

vessenes 76 days ago

I'll pipe in - a series of Mac optimized MOEs which can stream experts just in time would be really amazing. And popular; I'm guessing in the next year we'll be able to run a very able openclaw with a stack like that. You'll get a lot of installs there. If I were a PM at Gemma, I'd release a stack for each Mac mini memory size.

zozbot234 76 days ago

Expert streaming is something that has to be implemented by the inference engine/library, the model architecture itself has very little to do with it. It's a great idea (for local inference; it uses too much power at scale), but making it work really well is actually not that easy.

(I've mentioned this before but AIUI it would require some new feature definitions in GGUF, to allow for coalescing model data about any one expert-layer into a single extent, so that it can be accessed in bulk. That's what seems to make the new Flash-MoE work so well.)

vessenes 76 days ago

I’ve been doing some low-key testing on smaller models, and it looks to me like it’s possible to train an MOE model with characteristics that are helpful for streaming… For instance, you could add a loss function to penalize expert swapping both in a single forward, pass and across multiple forward passes. So I believe there is a place for thinking about this on the model training side.

zozbot234 76 days ago

Penalizing expert swaps doesn't seem like it would help much, because experts vary by layer and are picked layer-wise. There's no guarantee that expert X in layer Y that was used for the previous token will still be available for this token's load from layer Y. The optimum would vary depending on how much memory you have at any given moment, and such. It's not obviously worth optimizing for.

vessenes 76 days ago

Right. You need to predict a set of experts through the entire forward pass. Think of a vertical strip.

UncleOxidant 76 days ago

Something in the 60B to 80B range would still be approachable for most people running local models and also could give improved results over 31B.

Also, as I understand it the 26B is the MOE and the 31B is dense - why is the larger one dense and the smaller one MOE?

tjwebbnorfolk 75 days ago

All of gemma's main competitors have larger models in the 80-240b range that take advantage of larger VRAM GPUs and dual-GPU setups.

Personally I have 2x RTX 6000 PROs and right now am running the 235b-parameter Qwen model with very good results. I also occasionally use gpt-oss:120b. I would like to see a gemma model in the same range.

Also many people are running these on Mac Minis now with 128GB+ of unified RAM.

Aiming for the "runs on a single H100" tagline doesn't make a lot of sense to me, because most people do not have H100s anyway.

__mharrison__ 76 days ago

My sweet spot is something that runs on less than 128gb.

(I have a DGX Spark, and MBP w/ 128gb).

jimbob45 76 days ago

how good they need to be to make all of you super excited to use them

Isn't that more dictated by the competition you're facing from Llama and Qwent?

canyon289 76 days ago

This is going to sound like a corp answer but I mean this genuinely as an individual engineer. Google is a leader in its field and that means we get to chart our own path and do what is best for research and for users.

I personally strive to build software and models provides provides the best and most usable experience for lots of people. I did this before I joined google with open source, and my writing on "old school" generative models, and I'm lucky that I get to this at Google in the current LLM era.