|
|
|
|
|
by skirmish
558 days ago
|
|
Quantized 30B models should run in 24GB VRAM. A quick search found people doing that with good speed: [1] I have a 4090, PCIe 3x16, DDR4 RAM.
oobabooga/text-generation-webui
using exllama
I can load 30B 4bit GPTQ models and use full 2048 context
I get 30-40 tokens/s
[1] https://old.reddit.com/r/LocalLLaMA/comments/14gdsxe/optimal... |
|