Hacker News new | ask | show | jobs
by tildef 933 days ago
Loading it with 4-bit quantization takes a total of 22G vram including the rest of Xorg for me (not really trying to eek out any extra megs here). As such, the 8 bit version is probably not for people on consumer GPUs. Inference speed on a 4090 is about the same as the web version of ChatGPT with GPT4. The generated output at 4 bit is good so far (no worse than llama2 at least), though I haven't really put it through its paces.