Hacker News new | ask | show | jobs
by orost 1121 days ago
You can just barely fit a 33B GPTQ model in 24GB VRAM. It will be in 4-bit mode, and without maximum context size, but it will be quite fast. Or you can run from RAM+VRAM in GGML format with llama.cpp (or a derivative), which will easily fit 65B models even at 5 or 8 bits, but at much lower speed.