|
|
|
|
|
by jncraton
791 days ago
|
|
The speedup would not be that high in practice for folks already using speculative decoding[1]. ANPD is similar but uses a simpler and faster drafting approach. These two enhancements can't be meaningfully stacked. Here's how the paper describes it: > ANPD dynamically generates draft outputs via an adaptive N-gram module using real-time statistics, after which the drafts are verified by the LLM. This characteristic is exactly the difference between ANPD and the previous speculative decoding methods. ANPD does provide a more general-purpose solution to drafting that does not require training, loading, and running draft LLMs. [1] https://github.com/ggerganov/llama.cpp/pull/2926 |
|