| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by joliu 315 days ago

It does run inference, but on the batch of tokens that were drafted, akin to the prefill phase.

So your draft model can decode N new tokens, then the real model does one inference pass to score the N new drafted tokens.

Prefill is computation bound whereas decode is bandwidth bound, so in practice doing one prefill over N tokens is cheaper than doing N decode passes.