|
|
|
|
|
by superkuh
830 days ago
|
|
Depends on how many experts are active in any given pass. If it's a 10 expert mix of 33B experts (grok-0 is 33B, grok-1 is ~314B which is ~10x) and only runs two of them (like Mixtral's 2/8) then it'd have about the same inference requirements as a 70B model (2*33=66B). So if this was quantized using ~4 bits per parameter you'd need ~40GB of vram. So you could spread it across 2x 3090 24GB using llama.cpp. |
|
> So if this was quantized using ~4 bits per parameter you’d need ~40GB of vram.
No, Mixtral 8x7B (which is a total of 45 billion parameters, because there is a shared portion of the 7B, so its not 56 billion) at 4-bit quantization takes ~29GB [0]. A 314B model is ~7 times as large; with a similar architecture its not going to take only another 1/3 as much RAM.
[0] https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF