But as models are starting to pack more information into less bits, some weights are just going to end up becoming super important and very sensitive to quant. So, I'd just move down a Q size, and continue with K_XL. Like, I'm betting Q3_K_XL will beat Q4_K_M on any given model in real world testing, even though its ~20% smaller, but perform worse on benchmaxxing.
The only exception I could think of is quantizing small models, like, my testing on Gemma E2B/E4B and Qwen 3.5 9B, quantizing at all was super noticeable... they can't spread the error across more weights.
Good news (at least for me), 24GB of VRAM is enough to store either of those in BF16 and then a ton of room for F16/F16 KV cache.
But as models are starting to pack more information into less bits, some weights are just going to end up becoming super important and very sensitive to quant. So, I'd just move down a Q size, and continue with K_XL. Like, I'm betting Q3_K_XL will beat Q4_K_M on any given model in real world testing, even though its ~20% smaller, but perform worse on benchmaxxing.
The only exception I could think of is quantizing small models, like, my testing on Gemma E2B/E4B and Qwen 3.5 9B, quantizing at all was super noticeable... they can't spread the error across more weights.
Good news (at least for me), 24GB of VRAM is enough to store either of those in BF16 and then a ton of room for F16/F16 KV cache.