| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zozbot234 56 days ago
	Recent models support multi-token prediction, which can guess multiple future tokens in a single decode step (using some subset of the model itself, not a separate drafting model) and then verify them all at once. It's an emerging feature still (not widely supported) and it's only useful for speeding up highly predictable token runs, but it's one way to do better in practice than the common-sense theoretical limit might suggest.

2 comments

pbgcp2026 55 days ago

It seems to me it's only Grok 4.20 that does this currently? Which other models did you have in mind, if I may ask?

link

phamilton 55 days ago

Gemma4, qwen3.6, deepseek v4, mimo, glm 5/5.1 all do MTP.

link

pbgcp2026 55 days ago

Thank you, I just realised we are talking about MTP. It seems that it's not that clear though. "Currently, the MTP capabilities are primarily accessible through Google's proprietary LiteRT framework, rather than the open-weights versions... Despite the missing MTP heads in the open release, Gemma 4 (specifically the 26B-A4B variant) still demonstrates high efficiency"

link

parsimo2010 55 days ago

If Mistral Medium 3.5 supports it, that might get it to 10 t/s. It will still be fairly slow.

link