| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zozbot234 95 days ago
	Loading experts to system memory is supported by most local-AI frameworks. But you do not gain much by running that part of the decode on GPU, since decode is not compute-limited and the CPU-GPU transfer involves overhead. It's best to use the GPU for speeding up the shared part of the model.