| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by orost 1121 days ago
	You can just barely fit a 33B GPTQ model in 24GB VRAM. It will be in 4-bit mode, and without maximum context size, but it will be quite fast. Or you can run from RAM+VRAM in GGML format with llama.cpp (or a derivative), which will easily fit 65B models even at 5 or 8 bits, but at much lower speed.