| HN Mirror

As far as I understand it works like this:

1. Small model generates k tokens (probably k>=4 or even higher, there‘s a tradeoff to be made here, depending on the model sizes)

2. Big model processes all k tokens‘ logits (probabilities) in parallel.

3. Ideally, all tokens pass the probability threshold. That might be the case for standard phrases that the model likes to use, like „Alright, the user wants me to“. If not all tokens pass the probability threshold, then the first unsuitable token and all after are discarded.

4. Return to 1., maybe with an adjusted k.