| HN Mirror

Just this guy's assistant running against the official Q4_0 GGUF:

  ./llama-server \     
    -hf google/gemma-4-26B-A4B-it-qat-q4_0-gguf \
    --spec-draft-hf RachidAR/gemma-4-26B-A4B-it-qat-assistant-q4_0-gguf:Q4_0 \
    --spec-type draft-mtp \
    --spec-draft-n-max 3

I hadn't done any really radical testing so I've just had another look.

Without the MTP drafter, it is pretty consistently 75 tokens per second anyway, which is interesting.

With the MTP drafter it reaches well above 95 tokens per second handling the prompt and it will slowly drop to 65 or so with the output tokens as the prediction success rate slowly drops.

But with generated output it seems to me that the predictions are always going to drop dramatically over time.

I think my results here are broadly consistent with what people say about success rates with smaller and sparse models. I am going to test with n-max 4 in agentic situations at some point, and I may see whether it has much impact on the 31B model which is too slow to be practical otherwise.

I have a very unqualified feeling that MTP will matter more in agentic coding because of the larger prompts.

But my biggest issue since I installed it, I think, is that the combination is occasionally messing with markdown generation during thinking, and sometimes possibly losing the </think> at the end. I've seen it enough now to be fairly sure it is the Gemma MTP causing it. There is an open bug in the vLLM project about this and I wonder if something similar is going on in llama.cpp.

The speed without the MTP drafter is pretty solid so I am content to let more experienced people than me handle things while I learn other stuff, but I might go looking for some testing code that can prove it sometime.