| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by regularfry 606 days ago
	In addition to the other answers in this thread, there's a practical one: sometimes (ok, often) you want to run a model on a card that doesn't have enough VRAM for it. Quantisation is a way to squeeze it down so it fits. For instance I've got a 4090 that won't fit the original Llama3 70b at 16 bits per param, but it will give me usable token rates at 2 bits.