|
|
|
|
|
by janalsncm
38 days ago
|
|
The small draft model proposes a sequence of tokens d1 d2 d3. The big target model calculates P(d1) P(d2|d1) P(d3|d1 d2) In parallel. If we were just greedy decoding it would be simple. Just stop when the draft model doesn’t predict the most likely token as judged by the target model. At that point, append the correct token from the target model and kick off both models again in parallel. In practice we aren’t using greedy decoding. We are sampling and we need to match the target model’s distribution. To do this, we accept tokens from the draft model probabilistically, which is possible because we have the logits of both the draft model and the target at that point. The ratio of their softmax probabilities is used for this. You are right that actually accepting tokens has to happen sequentially but that’s a heck of a lot faster than a forward pass. |
|