Hacker News new | ask | show | jobs
by fc417fc802 114 days ago
Ideally you'd have (parameter count) * (bits per parameter) VRAM for the entire (presumably quantized, don't forget to account for that) model. So very approximately 16 GiB for a 34B model quantized to 4 bits per parameter.

You can spill to RAM in which case you at least want enough for a single active expert but really that's going to tank performance. If you're only "a bit" short of the full model the difference might not be all that large.

These things are memory bandwidth limited so if you check out RAM, VRAM, and PCIe bandwidth what I wrote above should make sense.

Also you should just ask your friendly local LLM these sorts of questions.

1 comments

I usually do ask the llm what parameters to use. But that’s why I know so little about parameters!