| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by furyofantares 86 days ago

Here is the line in vLLM's source code that determines if a draft token is accepted:

    accepted = draft_prob > 0 and target_prob / draft_prob >= uniform_prob

It does have a branch that checks only token id equality, which is used if temperature is 0.

1 comments

Klaus23 86 days ago

Good analysis. That's surprising. I always heard that the draft model doesn't affect the output in any way. It seems they do it like this to achieve faster generation. It would be interesting to investigate how this affects the output.

Edit: I haven't gone through all the code, but they might do something like this: https://arxiv.org/abs/2211.17192 where a draft model is used and the output distribution is tweaked on rejection, resulting in the exact same distribution as the main model.

link

furyofantares 86 days ago

I have convinced myself that it is in fact the same distribution, even if you don't get the same output on any given run. Pretty cool.

link