| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wishawa 255 days ago

I didn't know this! I've always thought speculative decoding was "if p(draft_token) > threshold, use it". You made me go read how it actually works and it's pretty neat!

That said, I still think some providers are cheating. Please correct me if the test below is flawed.

I generated texts at temperature = 0 vs temperature = 2. At high temperature, the distributions effectively become flatter, meaning the difference between real and draft effective distributions (the D_LK used in theorem 3.5 of 2211.17192) becomes smaller. When T=2, the model speaks complete gibberish, so the effective distribution must be pretty flat. This should mean fewer rejections --> a lot faster speculative decoding. Yet, I see no increase in throughput at all...

1 comments

sailingparrot 255 days ago

Not sure exactly what setup you are running, in theory yes, higher temperature for both model means higher chance of overlap and thus less rejections -> faster sampling (but worse quality overall).

However, if you have higher temperature but still are operating under a top-k sampling where k is small, not sure it's going to translate to any noticeable difference, since this will make your actual distributions very much non-uniform.

link

wishawa 255 days ago

This is with Together's API via OpenRouter, running DeepSeek V3 0324 and Kimi K2 0905.

I didn't set a top-k. So it seems like Together must be doing something weird in their speculative decoding implementation.

link

sailingparrot 255 days ago

Oh in that case there is definitely a top-k or top-p behind the scene, it might just not be exposed to the user as a param they can change through their API. I haven’t heard of anyone running a LLM in prod with actual pure sampling

link

wishawa 254 days ago

I see. That's slightly unfortunate. In principle, increasing temperature flattens out the distribution but the ordering between different tokens' probabilities remain the same, so setting a top-k shouldn't break my test. Can't say the same for top-p though. And all of this is probably too deep into the provider's implementation details for me to make assumptions on.

link