| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dot_treo 77 days ago
	Just to get it into a GGUF file would be fairly trivial. But using that GGUF file would need a bunch of additional things. One would need to create a new architecture derived from Qwen3, and then probably adapt the speculative decoding functionality. At the moment not even MTP is merged into llama.cpp, so I wouldn't quite hold my breath for it.

2 comments

kreelman 76 days ago

I thought that might be the case. I naively wondered. I'll see if I can understand the paper :-)

Hope the paper gets lots of references and the technique gets a lot of use to save power and time.

There's been several potential big changes for LLM inference efficiency over the last few months. There's been Attention Sequencing (I think it's called..?) Turbo Quant and this one.

Interesting times.

link

regularfry 76 days ago

MTP merged today, a couple of hours after your post by the looks of things.

link

dot_treo 76 days ago

By the looks of it, it will take a couple more follow up PRs to clean things up a bit and get the most performance from MTP. I hope that by that point it will be easier to add more spec decoding types.

In the meantime I've benchmarked Orthrus some more and got some quite promising results. So I'd be glad if my prediction that it may take some time until it lands in llama.cpp turns out to be wrong.

link