Hacker News new | ask | show | jobs
by kmeisthax 417 days ago
How exactly does this give a speedup? If you have to wait for the large model to confirm the small model's predictions, wouldn't it always be slower than just running the large model?
2 comments

As far as I understand it works like this:

1. Small model generates k tokens (probably k>=4 or even higher, there‘s a tradeoff to be made here, depending on the model sizes)

2. Big model processes all k tokens‘ logits (probabilities) in parallel.

3. Ideally, all tokens pass the probability threshold. That might be the case for standard phrases that the model likes to use, like „Alright, the user wants me to“. If not all tokens pass the probability threshold, then the first unsuitable token and all after are discarded.

4. Return to 1., maybe with an adjusted k.

Apparently for the bigger model checking a token is faster than generating a fresh one. So if they tiny model gets it right you get a tiny speed bump. Can’t say I fully understand it either why it’s faster to check

Needs a pretty large difference in size to result in a speedup. 0.5 vs 27b is the only ones I’ve seen a speedbump