How exactly does this give a speedup? If you have to wait for the large model to confirm the small model's predictions, wouldn't it always be slower than just running the large model?
1. Small model generates k tokens (probably k>=4 or even higher, there‘s a tradeoff to be made here, depending on the model sizes)
2. Big model processes all k tokens‘ logits (probabilities) in parallel.
3. Ideally, all tokens pass the probability threshold. That might be the case for standard phrases that the model likes to use, like „Alright, the user wants me to“. If not all tokens pass the probability threshold, then the first unsuitable token and all after are discarded.
Apparently for the bigger model checking a token is faster than generating a fresh one. So if they tiny model gets it right you get a tiny speed bump. Can’t say I fully understand it either why it’s faster to check
Needs a pretty large difference in size to result in a speedup. 0.5 vs 27b is the only ones I’ve seen a speedbump
1. Small model generates k tokens (probably k>=4 or even higher, there‘s a tradeoff to be made here, depending on the model sizes)
2. Big model processes all k tokens‘ logits (probabilities) in parallel.
3. Ideally, all tokens pass the probability threshold. That might be the case for standard phrases that the model likes to use, like „Alright, the user wants me to“. If not all tokens pass the probability threshold, then the first unsuitable token and all after are discarded.
4. Return to 1., maybe with an adjusted k.