I've tried MTP, and that got me about 1.5x on average with a very spec friendly benchmark.
I didn't run the full benchmark with the demo code, just picked up a single prompt from it. The prompt is about 1300 token, the response is about 3200 token.
Baseline: 44.8 t/s
With Orthrus: 164.6 t/s
Note: Don't use the `use_diffusion_mode=` config flag in their example to collect a baseline. Something about how the fallback to "normal" makes it grind to a crawl.