| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hmottestad 806 days ago
	Since it’s a MOE model it will only need to load a few of the 8 sub models into vram in order to answer a query. So it may look large, but I think a quantized model will easily fit on a Mac with 64GB of memory and maybe even a bit fewer bits and it’ll fit into 32GB. I think it might be the end for 24GB 4090 cards though :(

4 comments

dragonwriter 805 days ago

MOE models don’t, in practice, selectively load experts on activation (and if a runtime for them could be designed that would do that, it would make them perform worse, since the experts activated may differ from token to token, so you’d be churning a whole lot swapping portions of the model into and out of VRAM.) But they do less computation per token for their size than monolithic so you can often get tolerable performance on CPU or split between GPU/CPU at a ratio that would work poorly with a similarly-sized monolithic model.

But, still, its going to need 262GB for weights + a variable amount based on context without quantization, and 66GB+ at 4-bit quantization.

link

brandall10 806 days ago

Unless something has changed, it needs to load the full 8 models at the same time. During inference it performs like a 2 x base model.

Mixtral 7B @ 5 bit takes up over 30gb on my M3 Max. That's over 90 for this at the same quantization. Realistically you probably need a 128gb machine to run this with good results.

link

fzzzy 806 days ago

A 4 bit quant of the new one would still be about 70 gb, so yeah. Gonna need a lot more ram.

link

Kubuxu 805 days ago

The 8x is misleading; there are 8 sets of weights (experts) per token and per layer. If it is similar to the previous MoE Mistral models, then two experts get activated per token per layer. This reduces the amount of compute and memory bandwidth you need to perform inference but doesn't reduce the amount of memory you need as you cannot load the experts into GPU memory on demand without performance impact.

link

mark_l_watson 806 days ago

I think you are an optimist here. I can barely run mixtral-8x-7B on my M2 Pro 32G Mac, but I am grateful to be able to run it at all.

link

JanisErdmanis 806 days ago

Which quantization level are you using?

link

mark_l_watson 805 days ago

Q2, so not so great. I usually run other models. I would be embarrassed to tell you how long my “ollama list” is.

link