|
|
|
|
|
by Stagnant
2 hours ago
|
|
I've been using unsloth/gemma-4-31B-it-qat-GGUF daily for various small parsing and programming tasks using opencode and llama-server's front end. The past couple of weeks have made a big difference after google released the QAT variant and llama.cpp got support for MTP which means it is possible to now get 60-80 Tok/s with RTX 4090. The model fits in VRAM comfortably enough to keep it loaded even while browsing and having multiple programs. |
|
The real lovely thing was getting 300+ Tok/s (Gemma 4 26B QAT + MTP at UD-Q4_K_XL) (at peak, I think I saw vram usage reach 21 GB of vram)