| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by skirmish 605 days ago

Quantized 30B models should run in 24GB VRAM. A quick search found people doing that with good speed: [1]

    I have a 4090, PCIe 3x16, DDR4 RAM.
    
    oobabooga/text-generation-webui
    using exllama
    I can load 30B 4bit GPTQ models and use full 2048 context
    I get 30-40 tokens/s

[1] https://old.reddit.com/r/LocalLLaMA/comments/14gdsxe/optimal...

1 comments

treprinum 605 days ago

Quantized sure but there is some loss of variability of the output one can notice quickly with 30B models. If you want to use the fp16 version you are out of luck.

link