| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by burke 930 days ago
	Napkin math: 7x(4/8)x8 is 28GB, and q4 uses a little more than just 4 bits per param, and there’s extra overhead for context, and the FFN to select experts is probably more on top of that. It would probably fit in 32GB at 4-bit but probably won’t run with sensible quantization/perf on a 3090/4090 without other tricks like offloading. Depending on how likely the same experts are to be chosen for multiple sequential tokens, offloading experts may be viable.