| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by noboostforyou 800 days ago
	Perhaps there's a simple explanation but why does 24GB of VRAM offer such a large relative uplift in parameter count? (is memory bandwidth a factor rather than just the total memory amount?)

2 comments

wing-_-nuts 800 days ago

So, this is a bit misleading. For whatever reason the models tend to be released in certain parameter sizes. 7B models are popular. The next highest is 13B. There are few in between (some 11B). Likewise the jump from 13 is straight to 33B. You can run finetunes of a 33B model that have been cut down a little and fit them in a 24GB card. Likewise those 13B models running on 16GB cards have a lot of head room. You don't need to run as cut down a model, and you can run it with more context (i.e. the amount of your chat it can hold in memory)

I hope that helps, it's not 1:1, and it's a bit confusing

link

noboostforyou 799 days ago

Thank you, that's helpful context.

link

wkat4242 800 days ago

Probably quantisation.

I own a 4090 and I can only run very heavily quantised 33B models. It's not really worth it.

My LLM server with 16gb gpu mainly runs llama3 with expanded context window which also costs much more memory.

link

wing-_-nuts 800 days ago

Yeah, i have a 3090 and 64gb of ram. I can run a 8x7B and get pretty decent performance out of it with partial offloading.

link

wkat4242 800 days ago

Really?? For me it's terrible doing that. I also have 64GB RAM but meh. It's so bad when I can no longer offload everything. The tokens literally drizzle in. With full offloading they appear faster than I can read (8B llama3 with 8 bit quant). On a Radeon Pro VII with 16GB (HBM2 memory!)

link

wing-_-nuts 800 days ago

Oh man, I hate to say it, but it's likely your amd card. Yes, they can run LLMs and SD, but badly. Larger models are usable for me with partial offloading, but you're right that full loading the model in vram is really preferable.

link

wkat4242 800 days ago

I don't think so, because when I run it on the 4090 I get the same issue (in a system with 5800X3D and 64GB RAM also). I just don't use the 4090 for LLM because I have it for playing VR games and I don't want to tie it up for a 24/7 LLM server :) Also, it's very power-hungry. I do run that one on Windows and the Radeon server is Linux but I don't think that matters a lot. Using the same software stack too (ollama).

In fact the Radeon which cost me only 300 bucks new performs almost as well running LLMs as the 4090 which really surprised me! I think the fast memory (the Radeon has the same 1TB/s memory bandwidth as the 4090!) helps a lot there.

When I run a local model (significantly) bigger than the 24GB VRAM on the 4090 it won't even load for 15 minutes while the 4090 is pegged at 100% all the time. Eventually I just gave up.

link

wing-_-nuts 800 days ago

>When I run a local model (significantly) bigger than the 24GB VRAM on the 4090 it won't even load for 15 minutes while the 4090 is pegged at 100% all the time. Eventually I just gave up.

Yeah the key here is partial offloading. If you're trying to offload more layers than your GPU has memory for, you're gonna have a bad time. I find it kind of infuriating that this is still kind of a black art. There's definitely room for better tooling here.

Regardless, with 24GB of vram, I try to limit my offloading to 20GB and let the rest go to ram. Maybe it's the nature of the 8x7B model I run that makes it better at offloading than other large models. I'm not sure. I wouldn't try the 70B models for sure.

link