|
|
|
|
|
by vessenes
67 days ago
|
|
why is it that speculative decoding lowers quality? My understanding of it is that you use a small/distilled fast model to predict next token - when it doesn't match, you generate more. Checking against the large model is quick. This should maintain exactly the quality of the original model, no? |
|