| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lreeves 58 days ago
	Doesn't accepting 100% of the MTP draft tokens mean you should just be using the smaller model? Usually the acceptance rate in Qwen36 at least is around 60-70% and the "wrong" tokens are still filled in entirely by the base model, but when you just accept 100% of the draft tokens it seems kind of self defeating unless I'm wrong. Also I feel like everyone leaves off prompt processing/prefill speeds in these articles. If you are using a very small prompt and asking for mostly generated tokens, sure but I'd love to know the time-to-response of asking for an analysis of an image or a few hundred lines of code.

3 comments

dvdkon 58 days ago

As far as I know, speculative decoding still verifies that the proposed tokens are what the "big" model would generate, it just uses the guesses to make that process faster. Setting the probability threshold too low then shouldn't affect correctness, just speed (time will be wasted verifying bad guesses).

link

lreeves 58 days ago

But won't setting it to accept 100% of the proposed tokens will skip the verification?

link

ac29 58 days ago

None of those settings set the speculative decoder to accept 100% of drafted token. I assume you are looking at --draft-p-min 0.0, if so, you are misunderstanding what it does.

link

naasking 58 days ago

It depends on the type of MTP. If you're using two models, draft + full, then arguably yes, the larger model isn't providing much benefit if you really are seeing 100% acceptance rates. There are other forms of speculative decoding that work within the larger model by itself though, eg. Qwen has additional speculative decoding attention heads, so there is no secondary drafting model.

link