| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yekanchi 322 days ago
	i mean 4bit quantized. i can roughly calculate vram for dense models by model size. but i don't know how to do it for MOE models?

2 comments

EnPissant 322 days ago

MoE models need just as much VRAM as dense models because every token may use a different set of experts. They just run faster.

link

regularfry 322 days ago

This isn't quite right: it'll run with the full model loaded to RAM, swapping in the experts as it needs. It has turned out in the past that experts can be stable across more than one token so you're not swapping as much as you'd think. I don't know if that's been confirmed to still be true on recent MoEs, but I wouldn't be surprised.

link

mcrutcher 322 days ago

Also, though nobody has put the work in yet, the GH200 and GB200 (the NVIDIA "superchips" support exposing their full LPDDR5X and HBM3 as UVM (unified virtual memory) with much more memory bandwidth between LPDDR5X and HBM3 than a typical "instance" using PCIE. UVM can handle "movement" in the background and would be absolutely killer for these MoE architectures, but none of the popular inference engines actually allocate memory correctly for these architectures: cudaMallocManaged() or allow UVM (CUDA) to actually handle movement of data for them (automatic page migration and dynamic data movement) or are architected to avoid pitfalls in this environment (being aware of the implications of CUDA graphs when using UVM).

It's really not that much code, though, and all the actual capabilities are there as of about mid this year. I think someone will make this work and it will be a huge efficiency for the right model/workflow combinations (effectively, being able to run 1T parameter MoE models on GB200 NVL4 at "full speed" if your workload has the right characteristics).

link

EnPissant 322 days ago

What you are describing would be uselessly slow and nobody does that.

link

DiabloD3 322 days ago

I don't load all the MoE layers onto my GPU, and I have only about a 15% reduction in token generation speed while maintaining a model 2-3 times larger than VRAM alone.

link

EnPissant 322 days ago

The slowdown is far more than 15% for token generation. Token generation is mostly bottlenecked by memory bandwidth. Dual channel DDR5-6000 has 96GB/s and A rtx 5090 has 1.8TB/s. See my other comment when I show 5x slowdown in token generation by moving just the experts to the CPU.

link

DiabloD3 321 days ago

I suggest figuring out what your configuration problem is.

Which llama.cpp flags are you using, because I am absolutely not having the same bug you are.

link

furyofantares 322 days ago

I do it with gpt-oss-120B on 24 GB VRAM.

link

EnPissant 322 days ago

You don't. You run some of the layers on the CPU.

link

furyofantares 321 days ago

You're right that I was confused about that.

LM Studio defaults to 12/36 layers on the GPU for that model on my machine, but you can crank it to all 36 on the GPU. That does slow it down but I'm not finding it unusable and it seems like it has some advantages - but I doubt I'm going to run it this way.

link

bigyabai 322 days ago

I run the 30B Qwen3 on my 8GB Nvidia GPU and get a shockingly high tok/s.

link

EnPissant 322 days ago

For contrast, I get the following for a rtx 5090 and 30b qwen3 coder quantized to ~4 bits:

- Prompt processing 65k tokens: 4818 tokens/s

- Token generation 8k tokens: 221 tokens/s

If I offload just the experts to run on the CPU I get:

- Prompt processing 65k tokens: 3039 tokens/s

- Token generation 8k tokens: 42.85 tokens/s

As you can see, token generation is over 5x slower. This is only using ~5.5GB VRAM, so the token generation could be sped up a small amount by moving a few of the experts onto the GPU.

link

littlestymaar 322 days ago

AFAIK many people on /r/localLlama do pretty much that.

link

zettabomb 322 days ago

llama.cpp has built-in support for doing this, and it works quite well. Lots of people running LLMs on limited local hardware use it.

link

EnPissant 322 days ago

llama.cpp has support for running some of or all of the layers on the CPU. It does not swap them into the GPU as needed.

link

regularfry 322 days ago

It's neither hypothetical nor rare.

link

EnPissant 322 days ago

You are confusing running layers on the CPU.

link

DiabloD3 322 days ago

Same calculation, basically. Any given ~30B model is going to use the same VRAM (assuming loading it all into VRAM, which MoEs do not need to do), is going to be the same size

link