|
|
|
|
|
by dofm
10 days ago
|
|
llama.cpp has been patched to support Gemma 4 MTP, so thanks to: https://huggingface.co/RachidAR/gemma-4-26B-A4B-it-qat-assistant-q4_0-gguf
you can now do this: ./llama-server \
-hf google/gemma-4-26B-A4B-it-qat-q4_0-gguf \
--spec-draft-hf RachidAR/gemma-4-26B-A4B-it-qat-assistant-q4_0-gguf:Q4_0 \
--spec-type draft-mtp \
--spec-draft-n-max 3
Really quick on an M1 Max, definitely seeing a speedup. Though interestingly the oMLX performance with the MLX community MTP models is much worse. I am not really sure why (the parallel model overhead not being worth the speculative gain, I suppose, but I do not understand this stuff anywhere near as much as I would like). |
|