|
|
|
|
|
by moffkalast
462 days ago
|
|
Gemma 2 27B at 4 bits would be a drooling idiot anyway, even going down to 8 bits seems to significantly lobotomize it. Qwens are surprisingly resistant to quantization compared to most so it'll pull ahead just in that already in terms of coherence for the same VRAM amount. We'll see if the quantization aware versions are any better this time around, but I doubt any inference framework will even support them. Gemma.cpp never got a a standard compatible server API so people could actually use it, and as a result got absolutely zero adoption. |
|
All the above is subjective so maybe that’s true for you, but claiming there’s a lack of inference framework for gemma 2 is really off the mark.
Obviously ollama supports it. Also llama.cpp. Also mlx. I’ve listed 3 frameworks that support quantized versions of gemma 2
llama.cpp support for gemma-3 is out, the PR merged a couple hours after googles announcement. Obviously ollama supports it as well as you can see in TFA here.
I’m really curious how you’d get to the conclusions you’ve made. Are we living in different alternate universes?