|
|
|
|
|
by dot_treo
30 days ago
|
|
Just to get it into a GGUF file would be fairly trivial. But using that GGUF file would need a bunch of additional things. One would need to create a new architecture derived from Qwen3, and then probably adapt the speculative decoding functionality. At the moment not even MTP is merged into llama.cpp, so I wouldn't quite hold my breath for it. |
|
Hope the paper gets lots of references and the technique gets a lot of use to save power and time.
There's been several potential big changes for LLM inference efficiency over the last few months. There's been Attention Sequencing (I think it's called..?) Turbo Quant and this one.
Interesting times.