|
|
|
|
|
by wishawa
255 days ago
|
|
I didn't know this! I've always thought speculative decoding was "if p(draft_token) > threshold, use it". You made me go read how it actually works and it's pretty neat! That said, I still think some providers are cheating. Please correct me if the test below is flawed. I generated texts at temperature = 0 vs temperature = 2. At high temperature, the distributions effectively become flatter, meaning the difference between real and draft effective distributions (the D_LK used in theorem 3.5 of 2211.17192) becomes smaller. When T=2, the model speaks complete gibberish, so the effective distribution must be pretty flat. This should mean fewer rejections --> a lot faster speculative decoding. Yet, I see no increase in throughput at all... |
|
However, if you have higher temperature but still are operating under a top-k sampling where k is small, not sure it's going to translate to any noticeable difference, since this will make your actual distributions very much non-uniform.