Hacker News new | ask | show | jobs
by wing-_-nuts 755 days ago
The general rule is that VRAM == parameter count in billions (I'm generalizing gguf finetunes here)

8GB vram cards can run 7B models

16GB vram cards can run 13B models

24GB vram cards can run up to 33B models

Now to your question, what can most computers run? You need to look at the tiny but specialized models. I would think 3B models could be ran reasonably well even on the CPU. Intellij has a absolutely microscopic < 1B model that it uses for code completion locally. It's quite good and I don't notice any delay.

1 comments

Perhaps there's a simple explanation but why does 24GB of VRAM offer such a large relative uplift in parameter count? (is memory bandwidth a factor rather than just the total memory amount?)
So, this is a bit misleading. For whatever reason the models tend to be released in certain parameter sizes. 7B models are popular. The next highest is 13B. There are few in between (some 11B). Likewise the jump from 13 is straight to 33B. You can run finetunes of a 33B model that have been cut down a little and fit them in a 24GB card. Likewise those 13B models running on 16GB cards have a lot of head room. You don't need to run as cut down a model, and you can run it with more context (i.e. the amount of your chat it can hold in memory)

I hope that helps, it's not 1:1, and it's a bit confusing

Thank you, that's helpful context.
Probably quantisation.

I own a 4090 and I can only run very heavily quantised 33B models. It's not really worth it.

My LLM server with 16gb gpu mainly runs llama3 with expanded context window which also costs much more memory.

Yeah, i have a 3090 and 64gb of ram. I can run a 8x7B and get pretty decent performance out of it with partial offloading.
Really?? For me it's terrible doing that. I also have 64GB RAM but meh. It's so bad when I can no longer offload everything. The tokens literally drizzle in. With full offloading they appear faster than I can read (8B llama3 with 8 bit quant). On a Radeon Pro VII with 16GB (HBM2 memory!)
Oh man, I hate to say it, but it's likely your amd card. Yes, they can run LLMs and SD, but badly. Larger models are usable for me with partial offloading, but you're right that full loading the model in vram is really preferable.
I don't think so, because when I run it on the 4090 I get the same issue (in a system with 5800X3D and 64GB RAM also). I just don't use the 4090 for LLM because I have it for playing VR games and I don't want to tie it up for a 24/7 LLM server :) Also, it's very power-hungry. I do run that one on Windows and the Radeon server is Linux but I don't think that matters a lot. Using the same software stack too (ollama).

In fact the Radeon which cost me only 300 bucks new performs almost as well running LLMs as the 4090 which really surprised me! I think the fast memory (the Radeon has the same 1TB/s memory bandwidth as the 4090!) helps a lot there.

When I run a local model (significantly) bigger than the 24GB VRAM on the 4090 it won't even load for 15 minutes while the 4090 is pegged at 100% all the time. Eventually I just gave up.