Hacker News new | ask | show | jobs
by DeathArrow 31 days ago
If someone can make this work with GGUF and Quantized Qwen 3.6 or Deepseek 4 it would greatly help running local models.
1 comments

Multi-token prediction is available now, I'm still getting it set up but it sounds like it'll be 1.5x or 2x on the bigger models.
I've tried MTP, and that got me about 1.5x on average with a very spec friendly benchmark.

I didn't run the full benchmark with the demo code, just picked up a single prompt from it. The prompt is about 1300 token, the response is about 3200 token.

Baseline: 44.8 t/s With Orthrus: 164.6 t/s

Note: Don't use the `use_diffusion_mode=` config flag in their example to collect a baseline. Something about how the fallback to "normal" makes it grind to a crawl.