Hacker News new | ask | show | jobs
by wolttam 1 day ago
https://developer.nvidia.com/blog/an-introduction-to-specula...

You draft n tokens, and you verify them in a single forward pass.

Here's the vLLM flag:

    --speculative-config '{{"method":"mtp","num_speculative_tokens":2}}'
They may have only trained at a depth of 1, but boy-howdy, does that little MTP head do a pretty good of successfully predicting that second token about 60-80% of the time.

It works great. I'll keep my increased performance, and

> so i don't know why you are punching these documents into the chatbot, and asking it questions about them, and then it gives you the wrong answers

you keep whatever this is. I posted direct quotes from their papers which say "it speeds up inference" (paraphrasing). I don't feel there is anything I can do to turn this into a good-faith discussion. Beep boop.