|
|
|
|
|
by lreeves
11 days ago
|
|
Doesn't accepting 100% of the MTP draft tokens mean you should just be using the smaller model? Usually the acceptance rate in Qwen36 at least is around 60-70% and the "wrong" tokens are still filled in entirely by the base model, but when you just accept 100% of the draft tokens it seems kind of self defeating unless I'm wrong. Also I feel like everyone leaves off prompt processing/prefill speeds in these articles. If you are using a very small prompt and asking for mostly generated tokens, sure but I'd love to know the time-to-response of asking for an analysis of an image or a few hundred lines of code. |
|