| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jncraton 791 days ago

The speedup would not be that high in practice for folks already using speculative decoding[1]. ANPD is similar but uses a simpler and faster drafting approach. These two enhancements can't be meaningfully stacked. Here's how the paper describes it:

> ANPD dynamically generates draft outputs via an adaptive N-gram module using real-time statistics, after which the drafts are verified by the LLM. This characteristic is exactly the difference between ANPD and the previous speculative decoding methods.

ANPD does provide a more general-purpose solution to drafting that does not require training, loading, and running draft LLMs.

[1] https://github.com/ggerganov/llama.cpp/pull/2926

1 comments

MacsHeadroom 790 days ago

Who is already using speculative decoding? I haven't seen anything about it in the llama.cpp or ollama docs.

link

eshoyuan 790 days ago

https://github.com/ggerganov/llama.cpp/tree/master/examples/...

link