|
|
|
|
|
by mmoskal
781 days ago
|
|
Yes.
It's speculative decoding but instead of generating just a few sequential tokens with the draft model they generate a whole tree of some sort of optimal shape with hundreds of possible sequences. It ends up being somewhat faster than regular speculative decoding in normal setting (GPU only). If you are doing CPU offloading it's massively faster. Edit typo |
|