Hacker News new | ask | show | jobs
by EnPissant 276 days ago
What you are describing would be uselessly slow and nobody does that.
6 comments

I don't load all the MoE layers onto my GPU, and I have only about a 15% reduction in token generation speed while maintaining a model 2-3 times larger than VRAM alone.
The slowdown is far more than 15% for token generation. Token generation is mostly bottlenecked by memory bandwidth. Dual channel DDR5-6000 has 96GB/s and A rtx 5090 has 1.8TB/s. See my other comment when I show 5x slowdown in token generation by moving just the experts to the CPU.
I suggest figuring out what your configuration problem is.

Which llama.cpp flags are you using, because I am absolutely not having the same bug you are.

It's not a bug. It's the reality of token generation. It's bottlenecked by memory bandwidth.

Please publish your own benchmarks proving me wrong.

I cannot reproduce your bug on AMD. I'm going to have to conclude this is a vendor issue.
I do it with gpt-oss-120B on 24 GB VRAM.
You don't. You run some of the layers on the CPU.
You're right that I was confused about that.

LM Studio defaults to 12/36 layers on the GPU for that model on my machine, but you can crank it to all 36 on the GPU. That does slow it down but I'm not finding it unusable and it seems like it has some advantages - but I doubt I'm going to run it this way.

FWIW, that's a 80GB model and you also need kv cache. You'd need 96GBish to run on the GPU.
Do you know if it's doing what was described earlier, when I run it with all layers on GPU - paging an expert in every time the expert changes? Each expert is only 5.1B parameters.
It makes absolutely no sense to do what OP described. The decode stage is bottlenecked on memory bandwidth. Once you pull the weights from system RAM, your work is almost done. To then gigabytes of weights PER TOKEN over PCIE to do some trivial computation on the GPU is crazy.

What actually happens is you run some or all of the MoE layers on the CPU from system RAM. This can be tolerable for smaller MoE models, but keeping it all on the GPU will still be 5-10x faster.

I'm guessing lmstudio gracefully falls back to running _soemthing_ on the CPU. Hopefully you are running only MoE on the CPU. I've only ever used llama.cpp.

^ Er, misspoke, each expert is at most .9 B parameters there's 128 experts. 5.1 B is number of active parameters (4 experts + some other parameters).
I run the 30B Qwen3 on my 8GB Nvidia GPU and get a shockingly high tok/s.
For contrast, I get the following for a rtx 5090 and 30b qwen3 coder quantized to ~4 bits:

- Prompt processing 65k tokens: 4818 tokens/s

- Token generation 8k tokens: 221 tokens/s

If I offload just the experts to run on the CPU I get:

- Prompt processing 65k tokens: 3039 tokens/s

- Token generation 8k tokens: 42.85 tokens/s

As you can see, token generation is over 5x slower. This is only using ~5.5GB VRAM, so the token generation could be sped up a small amount by moving a few of the experts onto the GPU.

AFAIK many people on /r/localLlama do pretty much that.
llama.cpp has built-in support for doing this, and it works quite well. Lots of people running LLMs on limited local hardware use it.
llama.cpp has support for running some of or all of the layers on the CPU. It does not swap them into the GPU as needed.
It's neither hypothetical nor rare.
You are confusing running layers on the CPU.