|
|
|
|
|
by orost
1121 days ago
|
|
You can just barely fit a 33B GPTQ model in 24GB VRAM. It will be in 4-bit mode, and without maximum context size, but it will be quite fast. Or you can run from RAM+VRAM in GGML format with llama.cpp (or a derivative), which will easily fit 65B models even at 5 or 8 bits, but at much lower speed. |
|