| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by moffkalast 843 days ago
	True, it's the only way I can for example run Mixtral on a 8GB GPU, but main memory will always have more latency so some tradeoff tends to be worth it. And parts like the prompt batch buffer and most of the context generally have to be in VRAM if you want to use cuBLAS, with OpenBLAS it's maybe less of a problem, but it is slower.