Hacker News new | ask | show | jobs
by WhiteDawn 6 days ago
Once someone generates a MTP layer for 26B A4B 4 QAT I'll be singing from the hills with my 5 year old GPU.
2 comments

Models:

- Safetensors: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-un...

- GGUF: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/tree/...

Note the README in the Unsloth list of files: llama.cpp is working on a PR to support the gemma4 drafters: https://github.com/ggml-org/llama.cpp/pull/23398. Also note the PR submitter didn't experience much speedup with 26B (seems typical that MoE models don't generally benefit from MTP).

This is safetensors. Is there any way to run these on a Mac paired with the MLX QAT?

(Pardon my ignorance; this stuff moves so fast)

Did you see this?

https://point.free/blog/gemma-4-on-a-2016-xeon/

Xeon, but could be useful for MTP on Mac.

To briefly follow up, as of yesterday llama.cpp can do Gemma 4's MTP, so I have this working at least initially — details here:

https://news.ycombinator.com/item?id=48441450

I hadn't seen this, thanks.

I do have the Qwen 3.6 (35B) MTP implementation running (in LM Studio; it doesn't need a separate drafter), along with non-MTP Gemma 4 26B, and I can see that Unsloth Studio can run the new QAT, but I can't see how you can run the assistant/drafter. Yet.

It's just a constantly changing landscape. Don't get me wrong, it's fascinating and for various reasons I am pleased I can keep up even slightly, but eeeehhh :-)

Yeah — that is the base QAT model, and there are safetensors weights for the QAT version of the MTP drafter, but there are no MLX/GGUF versions. I think the answer is a combination of:

1) Gemma 4 MTP is too fresh for off-the-shelf software to use anyway

2) "you can convert them yourself" which is fine, obvs