| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stingraycharles 281 days ago
	Because then the second token only needs to be checked, not generated, as it’s already generated? And it’s much faster to generate multiple tokens at the same time than one at a time? Is that the idea? I’m not an expert on LLMs, just a user.

4 comments

tomp 281 days ago

No, the parent is wrong.

Checking a token is the same as generating it.

The benefit however is in the next (third) token. After generating tokens 1 and 2 (in one turn), you start generating token 3 (and 4). You also get the “real” prediction for token 2. If the “real” prediction matches the MTP (Multi-Token Prediction) from previous turn, you have just generated 3 correct tokens (and another speculative). If not, you’ve now corrected token 2, but token 3 is wrong (it follows the wrong token 2) so you need ti generate it again.

link

bigwheels 280 days ago

Thanks for the clarification. Your comment made me connect the similarity (in spirit) of Speculative Decoding to Speculative Execution [1] in CPUs. Very cool and clever optimization strategy for LLMs, IMHO.

[1] https://en.wikipedia.org/wiki/Speculative_execution

Does it work to predict tokens 3 and 4 (or 5, 6) in the same way? I wonder how extreme the hit rate drop-off is.

link

jychang 278 days ago

To clarify, I should have stated: "Instead of generating tokens one at a time, you generate the second one as well WITH MTP, and then use speculative decoding on that second token (instead of having the second token be produced by a draft model like Qwen 0.6b). If the FIRST MTP token is checked and is correct, then the second token gets generated MUCH faster."

link

bdcs 281 days ago

It relies on an “unintuitive observation”[0] that you can run batches basically for free (up to a limit). So if you only run one inference, you batch it plus a lot of guesses and, if you guess right, can speed up the inference by the number of guesses. If you guess wrong, you're back to regular speed (and still fully correct).

[0] https://x.com/karpathy/status/1697318534555336961

link

namibj 281 days ago

Basically you can generate the next two tokens at once in the same matmul, and rollback to one-at-a-time when your generation said you guessed wrong (as that will mean the second of your pair you generated was generated based on revoked context).

link

Zacharias030 280 days ago

yes, if you know the sequence of tokens ahead of time you can verify them about as quickly as you can generate one more token because of the parallelism benefits.

If you don’t know the future tokens though, then you can’t, and blind guessing of tokens is infeasible because the vocabulary contains circa 100k possible different tokens.

link