| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nutrientharvest 730 days ago
	Ollama can already run Llama-3 70B with a 4GB GPU, or no GPU at all, it'll just be slow. Considering this says it's "not designed for real-time interactive scenarios" it's probably also really slow

1 comments

cpill 730 days ago

so how much GPU RAM does need to get the 70B going fast (ish)?

link

AaronFriel 730 days ago

A good rule of thumb is that models can be quantized to 6 to 8 bits per weight without significantly degrading quality. This is convenient for the math: 70GB plus some overhead for the attention matrices (ongoing requests). This overhead depends on workload and context lengths, but you should expect about 30% more. So, around 100GB for a server under load.

link