Hacker News new | ask | show | jobs
by M4v3R 919 days ago
You need to pick the correct model size and quantization for the amount of GPU RAM you have. For any given model don’t download the default file, instead go to Tags section on Ollama’s page and pick a quantization whose size in GB is at most 2/3rd of your available RAM, and it should work. For example in your case Mistral-7B q4_0 and even q8_0 should work perfectly.
1 comments

whats the intuition for 2/3 of RAM?
Because there’s always some overhead during inference plus you don’t want to fill all your available RAM because you risk swapping to disk which will make everything slow to a crawl.
so why is the overhead a 1/3 ratio instead of a constant amount? just testing the scaling assumption
you need some leftover for holding the context