Orders of magnitude seems a bit ambitious. The implementation from the DeepMind paper achieved a 2-2.5X https://arxiv.org/pdf/2302.01318 and most of the tests I've seen [1][2] have been similar, but there are different variations (Medusa, Ouroboros, etc) that can do better/be combined. Recently Together.ai published SpecExec, a SD variant which did claim to get a 10-18X speedups: https://www.together.ai/blog/specexec
Note they were testing on AMD-Llama-135m-code as draft model for CodeLlama-7b, both of which do similarly badly on Humaneval Pass@1 (~30%), so it's likely if they were using a similarly trained 135m to SD for say, Qwen2.5-Coder (88.4% on HumanEval), the perf gains would probably be much worse.
- 1.75x-2.80x on MI250
- 2.83x-2.98x on NPU
- 3.57x-3.88x on CPU
Note they were testing on AMD-Llama-135m-code as draft model for CodeLlama-7b, both of which do similarly badly on Humaneval Pass@1 (~30%), so it's likely if they were using a similarly trained 135m to SD for say, Qwen2.5-Coder (88.4% on HumanEval), the perf gains would probably be much worse.