| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jejones3141 830 days ago
	What kind of hardware would one need to use the model with reasonable performance?

4 comments

nirav72 830 days ago

I wouldn’t bother. Probably should wait until someone releases an optimized version.

link

superkuh 830 days ago

Depends on how many experts are active in any given pass. If it's a 10 expert mix of 33B experts (grok-0 is 33B, grok-1 is ~314B which is ~10x) and only runs two of them (like Mixtral's 2/8) then it'd have about the same inference requirements as a 70B model (2*33=66B).

So if this was quantized using ~4 bits per parameter you'd need ~40GB of vram. So you could spread it across 2x 3090 24GB using llama.cpp.

link

dragonwriter 830 days ago

MoE has the same “loading” RAM requirements as any other model with the same total parameters (not just for the fixed portion plus whatever experts are activated at any one time) because it has to load all the parameters. The additional needed because of context may be lower (not sure), but the big difference is that it has much better inference speed (and, as a result, can be tolerable with layers split between VRAM and system RAM where a similarly-sized non-MoE model would not.)

> So if this was quantized using ~4 bits per parameter you’d need ~40GB of vram.

No, Mixtral 8x7B (which is a total of 45 billion parameters, because there is a shared portion of the 7B, so its not 56 billion) at 4-bit quantization takes ~29GB [0]. A 314B model is ~7 times as large; with a similar architecture its not going to take only another 1/3 as much RAM.

[0] https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF

link

superkuh 830 days ago

You're absolutely right. I don't know what I was thinking.

link

zaptrem 830 days ago

I think rule of thumb is 1GB VRAM per 1 billion params quantized to FP8.

link

dragonwriter 830 days ago

Just to load the model without actually running it requires 1GB of whatever RAM it is loading and running in (could be VRAM, system RAM, or a combination, with different performance characteristics for each option) per billion parameters at 8-bit quantization. Though models often are usefully run at 4-5 bit quantization, which saves half (or nearly so) of that.

You also need additional RAM that increases as some function of context size (not sure what function, and ISTR there are big-O differences between architectures in how it varies) to actually do inference.

link

trillic 830 days ago

8xH100s to keep it all in RAM from my uneducated view.

link