|
|
|
|
|
by joliu
315 days ago
|
|
It does run inference, but on the batch of tokens that were drafted, akin to the prefill phase. So your draft model can decode N new tokens, then the real model does one inference pass to score the N new drafted tokens. Prefill is computation bound whereas decode is bandwidth bound, so in practice doing one prefill over N tokens is cheaper than doing N decode passes. |
|