Hacker News new | ask | show | jobs
by hskalin 426 days ago
With ollama you could offload a few layers to cpu if they don't fit in the VRAM. This will cost some performance ofcourse but it's much better than the alternative (everything on cpu)
2 comments

I'm doing that with a 12GB card, ollama supports it out of the box.

For some reason, it only uses around 7GB of VRAM, probably due to how the layers are scheduled, maybe I could tweak something there, but didn't bother just for testing.

Obviously, perf depends on CPU, GPU and RAM, but on my machine (3060 + i5-13500) it's around 2 t/s.

Does it work on LM Studio? Loading 27b-it-qat taking up more than 22GB on 24GB mac.