|
|
|
|
|
by modeless
807 days ago
|
|
It's a start but it's disappointing that half the layers still have to process every token. It seems like we ought to be able to get to 90% or even 99% savings when these models currently allocate the same compute for outputting "the" as they do for outputting the first digit of the answer of a complicated math problem. |
|
https://huggingface.co/blog/whisper-speculative-decoding