Hacker News new | ask | show | jobs
by dofm 10 days ago
llama.cpp has been patched to support Gemma 4 MTP, so thanks to:

  https://huggingface.co/RachidAR/gemma-4-26B-A4B-it-qat-assistant-q4_0-gguf
you can now do this:

  ./llama-server \
      -hf google/gemma-4-26B-A4B-it-qat-q4_0-gguf \
      --spec-draft-hf RachidAR/gemma-4-26B-A4B-it-qat-assistant-q4_0-gguf:Q4_0 \
      --spec-type draft-mtp \
      --spec-draft-n-max 3
Really quick on an M1 Max, definitely seeing a speedup. Though interestingly the oMLX performance with the MLX community MTP models is much worse. I am not really sure why (the parallel model overhead not being worth the speculative gain, I suppose, but I do not understand this stuff anywhere near as much as I would like).