Hacker News new | ask | show | jobs
by luyu_wu 633 days ago
The section on speculative execution is interesting. "This approach allows each forward pass to generate multiple tokens without compromising performance, thereby significantly reducing memory access consumption, and enabling several orders of magnitude speed improvements."

Does anyone know if the "several orders of magnitude speed improvement" is accurate? I'm doubtful.

Very interesting though! I'll be playing around with this on the weekend!

1 comments

Orders of magnitude seems a bit ambitious. The implementation from the DeepMind paper achieved a 2-2.5X https://arxiv.org/pdf/2302.01318 and most of the tests I've seen [1][2] have been similar, but there are different variations (Medusa, Ouroboros, etc) that can do better/be combined. Recently Together.ai published SpecExec, a SD variant which did claim to get a 10-18X speedups: https://www.together.ai/blog/specexec

[1] https://www.reddit.com/r/LocalLLaMA/comments/17h4rqz/specula...

[2] https://arxiv.org/pdf/2402.01528v3

BTW, I got a chance to read through the model card and there's a section that shows their SD gains: https://huggingface.co/amd/AMD-Llama-135m#speculative-decodi...

- 1.75x-2.80x on MI250

- 2.83x-2.98x on NPU

- 3.57x-3.88x on CPU

Note they were testing on AMD-Llama-135m-code as draft model for CodeLlama-7b, both of which do similarly badly on Humaneval Pass@1 (~30%), so it's likely if they were using a similarly trained 135m to SD for say, Qwen2.5-Coder (88.4% on HumanEval), the perf gains would probably be much worse.