|
|
|
|
|
by wolttam
1 day ago
|
|
https://developer.nvidia.com/blog/an-introduction-to-specula... You draft n tokens, and you verify them in a single forward pass. Here's the vLLM flag: --speculative-config '{{"method":"mtp","num_speculative_tokens":2}}'
They may have only trained at a depth of 1, but boy-howdy, does that little MTP head do a pretty good of successfully predicting that second token about 60-80% of the time.It works great. I'll keep my increased performance, and > so i don't know why you are punching these documents into the chatbot, and asking it questions about them, and then it gives you the wrong answers you keep whatever this is. I posted direct quotes from their papers which say "it speeds up inference" (paraphrasing). I don't feel there is anything I can do to turn this into a good-faith discussion. Beep boop. |
|