| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by orost 912 days ago
	You can partially offload with some backends (e.g. llama.cpp and derivatives) but speed gains from that don't come in until it's mostly offloaded. I have 8GB VRAM and it's not enough to get any boost on mixtral in Q8. 16GB might do better or it might not. The speed is quite good even on CPU only though, I get 3.5 tokens per second with 6 cores and DDR5-6000. For comparison llama2-70B is less than 1 t/s on the same hardware in Q4. And, subjectively, Mixtral performs better.