|
|
|
|
|
by burke
930 days ago
|
|
Napkin math: 7x(4/8)x8 is 28GB, and q4 uses a little more than just 4 bits per param, and there’s extra overhead for context, and the FFN to select experts is probably more on top of that. It would probably fit in 32GB at 4-bit but probably won’t run with sensible quantization/perf on a 3090/4090 without other tricks like offloading. Depending on how likely the same experts are to be chosen for multiple sequential tokens, offloading experts may be viable. |
|