| HN Mirror

AFAIU It's not that checking against the large model is quick (in the usual P!=NP sense that checking an answer is easier than finding one). It's that you can batch your checks. So you speculate the next 5 tokens, and then you can parallelize the large model running once for the batch of [...,n+1], [...,n+2], [...,n+3], [...,n+4], [...,n+5]. If you guessed right for a prefix, you turned a sequential problem (computing next token from current prefix) into a parallel one (doing multiple prefixes together) that the GPU likes. If you guessed wrong, you have to throw away the suffix starting at the wrong guess, and you wasted some extra energy computing.