Hacker News new | ask | show | jobs
by zozbot234 56 days ago
Recent models support multi-token prediction, which can guess multiple future tokens in a single decode step (using some subset of the model itself, not a separate drafting model) and then verify them all at once. It's an emerging feature still (not widely supported) and it's only useful for speeding up highly predictable token runs, but it's one way to do better in practice than the common-sense theoretical limit might suggest.
2 comments

It seems to me it's only Grok 4.20 that does this currently? Which other models did you have in mind, if I may ask?
Gemma4, qwen3.6, deepseek v4, mimo, glm 5/5.1 all do MTP.
Thank you, I just realised we are talking about MTP. It seems that it's not that clear though. "Currently, the MTP capabilities are primarily accessible through Google's proprietary LiteRT framework, rather than the open-weights versions... Despite the missing MTP heads in the open release, Gemma 4 (specifically the 26B-A4B variant) still demonstrates high efficiency"
If Mistral Medium 3.5 supports it, that might get it to 10 t/s. It will still be fairly slow.