Hacker News new | ask | show | jobs
by ggamecrazy 395 days ago
They literally can! The exact speculative method is supported on vLLM using `speculative_model="[ngram]"`[1]

1: https://docs.vllm.ai/en/latest/features/spec_decode.html#spe...

1 comments

Not quite. The paper uses its own N-gram rules with positive/negative/invariant weights as a rudimentary attention, and these rules are distilled from the model itself.

This, as I found out from this repo [0] linked in the Twitter thread in the documentation (which for some reason they didn't just link to directly), seems to be a regular Markov chain of context, if it even builds a stochastic matrix. See algorithm below.

  Current prompt
  "Article: (CNN)French striker Bafetimbi Gomis, who has a history of [...]
  Summary: French stri"

  Prompt lookup algorithm
  1. Get last few tokens from prompt -"French stri"
  2. Search for "French stri" in prompt
  3. Match found - return next k tokens after match as candidate completion -"ker Bafetimbi Gomis, who has"

  Candidate tokens
  "ker Bafetimbi Gomis, who has"
[0] https://github.com/apoorvumang/prompt-lookup-decoding