Hacker News new | ask | show | jobs
by linuxhansl 4 days ago
This is really awesome. I tried my test-task (generate a Python wrapper for a fairly complex C-interface) with gemma4:26b-a4b-it-qat. And for the first time it would just do it, without prodding and without errors. (Of course Claude just did it a heartbeat, too.)

My optimal local setup now is gemmma4-qat and Q8_0 K/V cache quantization with 256k context windows. And that runs fine with 12GB VRAM and another 10GB in RAM.

Previously I tried with gemma4:26b-a4b-it-q4_K_M and qwen3.6:35b-a3b-q4_K_M, and they both would tie themselves into knots (especially qwen3.6 can take forever with incessant "but wait..." thinking loops.) More often than not, they would not finish the task.

It seems true these 4b QAT models are as precise as Q8_0 quantization (which is supposedly indistinguishable from bf16).

I am really excited about the prospect of local LLM inference.