| HN Mirror

I don't think ollama is quantizing the embeddings table, which is still full FP16.

If you're using MLX, that means you're on a mac, in which case ollama actually isn't your best option. Either directly use llama.cpp if you're a power user, or use LM Studio if you want something a bit better than ollama but more user friendly than llama.cpp. (LM Studio has a GUI and is also more user friendly than ollama, but has the downsides of not being as scriptable. You win some, you lose some.)

Don't use MLX, it's not as fast/small as the best GGUFs currently (and also tends to be more buggy, it currently has some known bugs with japanese). Download the LM Studio version of the Gemma 3 QAT GGUF quants, which are made by Bartowski. Google actually directly mentions Bartowski in blog post linked above (ctrl-f his name), and his models are currently the best ones to use.

https://huggingface.co/bartowski/google_gemma-3-27b-it-qat-G...

The "best Gemma 3 27b model to download" crown has taken a very roundabout path. After the initial Google release, it went from Unsloth Q4_K_M, to Google QAT Q4_0, to stduhpf Q4_0_S, to Bartowski Q4_0 now.