|
|
|
|
|
by jychang
276 days ago
|
|
Speculative decoding! It makes inference a LOT faster. Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster. If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster. |
|
I’m not an expert on LLMs, just a user.