| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by M4v3R 919 days ago
	You need to pick the correct model size and quantization for the amount of GPU RAM you have. For any given model don’t download the default file, instead go to Tags section on Ollama’s page and pick a quantization whose size in GB is at most 2/3rd of your available RAM, and it should work. For example in your case Mistral-7B q4_0 and even q8_0 should work perfectly.

1 comments

swyx 919 days ago

whats the intuition for 2/3 of RAM?

link

M4v3R 919 days ago

Because there’s always some overhead during inference plus you don’t want to fill all your available RAM because you risk swapping to disk which will make everything slow to a crawl.

link

swyx 918 days ago

so why is the overhead a 1/3 ratio instead of a constant amount? just testing the scaling assumption

link

avereveard 919 days ago

you need some leftover for holding the context

link