Hacker News new | ask | show | jobs
by Stagnant 2 hours ago
I've been using unsloth/gemma-4-31B-it-qat-GGUF daily for various small parsing and programming tasks using opencode and llama-server's front end. The past couple of weeks have made a big difference after google released the QAT variant and llama.cpp got support for MTP which means it is possible to now get 60-80 Tok/s with RTX 4090. The model fits in VRAM comfortably enough to keep it loaded even while browsing and having multiple programs.
2 comments

110+ Tok/s as another data point on the RTX 5090 (Gemma 4 31B QAT + MTP at UD-Q4_K_XL) (at peak used 27 GB of vram)

The real lovely thing was getting 300+ Tok/s (Gemma 4 26B QAT + MTP at UD-Q4_K_XL) (at peak, I think I saw vram usage reach 21 GB of vram)

the problem of that setup is that it will run out of context pretty quick. So for coding agent it will limit your workflow very fast.