Hacker News new | ask | show | jobs
by 0-_-0 29 days ago
3.6 already supports multi token generation AFAIK
1 comments

Yes, but not diffusion based, it's still doing token-at-a-time speculation.
I thought it can do multiple tokens at a time
Think of this as another way of achieving that. This theoretically has a higher ceiling of how much it can predict at a time. And more importantly is a lot more memory efficient during actual inference.
There was a chart from the Unsloth folks posted to Reddit in the last couple of days which showed that the draft sweet spot for MTP was 2-3 tokens ahead depending on the quant. Thats not much, and I think this might do a lot better. The whole "provably identical distribution" thing is doing a lot of work in my head, and I don't think that's true of the MTP model in qwen's architecture.