Hacker News new | ask | show | jobs
by manmal 418 days ago
A less known feature of LM Studio I really like is speculative decoding: https://lmstudio.ai/blog/lmstudio-v0.3.10

Basically you let a very small model speculate on the next few tokens, and the large model then blesses/rejects those predictions. Depending on how well the small model performs, you get massive speedups that way.

The small model has to be as close to the big model as possible - I tried this with models from different vendors and it slowed generation down by x3 or so. So, you need to use a small Qwen 2.5 with a big Qwen 2.5, etc

1 comments

How exactly does this give a speedup? If you have to wait for the large model to confirm the small model's predictions, wouldn't it always be slower than just running the large model?
As far as I understand it works like this:

1. Small model generates k tokens (probably k>=4 or even higher, there‘s a tradeoff to be made here, depending on the model sizes)

2. Big model processes all k tokens‘ logits (probabilities) in parallel.

3. Ideally, all tokens pass the probability threshold. That might be the case for standard phrases that the model likes to use, like „Alright, the user wants me to“. If not all tokens pass the probability threshold, then the first unsuitable token and all after are discarded.

4. Return to 1., maybe with an adjusted k.

Apparently for the bigger model checking a token is faster than generating a fresh one. So if they tiny model gets it right you get a tiny speed bump. Can’t say I fully understand it either why it’s faster to check

Needs a pretty large difference in size to result in a speedup. 0.5 vs 27b is the only ones I’ve seen a speedbump