| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tarruda 930 days ago
	Theoretically it could fit into a single 24GB GPU if 4-bit quantized. Exllama v2 has even more efficient quantization algorithm, and was able to fit 70B models in 24GB gpu, but only with 2048 tokens of context.