Hacker News new | ask | show | jobs
by nutrientharvest 730 days ago
Ollama can already run Llama-3 70B with a 4GB GPU, or no GPU at all, it'll just be slow.

Considering this says it's "not designed for real-time interactive scenarios" it's probably also really slow

1 comments

so how much GPU RAM does need to get the 70B going fast (ish)?
A good rule of thumb is that models can be quantized to 6 to 8 bits per weight without significantly degrading quality. This is convenient for the math: 70GB plus some overhead for the attention matrices (ongoing requests). This overhead depends on workload and context lengths, but you should expect about 30% more. So, around 100GB for a server under load.