Hacker News new | ask | show | jobs
by jncraton 791 days ago
The speedup would not be that high in practice for folks already using speculative decoding[1]. ANPD is similar but uses a simpler and faster drafting approach. These two enhancements can't be meaningfully stacked. Here's how the paper describes it:

> ANPD dynamically generates draft outputs via an adaptive N-gram module using real-time statistics, after which the drafts are verified by the LLM. This characteristic is exactly the difference between ANPD and the previous speculative decoding methods.

ANPD does provide a more general-purpose solution to drafting that does not require training, loading, and running draft LLMs.

[1] https://github.com/ggerganov/llama.cpp/pull/2926

1 comments

Who is already using speculative decoding? I haven't seen anything about it in the llama.cpp or ollama docs.