Hacker News new | ask | show | jobs
by doctorpangloss 1 day ago
look... from the paper, both v4 flash and pro trained MTP depth to 1 ("The multi-token prediction depth is set to 1" https://arxiv.org/pdf/2606.19348v1#subsection.2.1 pg 25). it doesn't predict the next 2 tokens. the verifier is the whole model. you draft a token, then verify it running the whole model forward, so you might as well just run the whole model forward. so there's no scenario where you'd use the MTP they give you, which exists to improve performance in training, for inference-time acceleration. you can do something else. alternatively, by all means, see for yourself. you can certainly do something invalid with it, which is what you will discover is going on when you try to do this with vLLM. make sure to reply with a pirate accent. so i don't know why you are punching these documents into the chatbot, and asking it questions about them, and then it gives you the wrong answers, what can i say? it's just limited.
1 comments

https://developer.nvidia.com/blog/an-introduction-to-specula...

You draft n tokens, and you verify them in a single forward pass.

Here's the vLLM flag:

    --speculative-config '{{"method":"mtp","num_speculative_tokens":2}}'
They may have only trained at a depth of 1, but boy-howdy, does that little MTP head do a pretty good of successfully predicting that second token about 60-80% of the time.

It works great. I'll keep my increased performance, and

> so i don't know why you are punching these documents into the chatbot, and asking it questions about them, and then it gives you the wrong answers

you keep whatever this is. I posted direct quotes from their papers which say "it speeds up inference" (paraphrasing). I don't feel there is anything I can do to turn this into a good-faith discussion. Beep boop.