Hacker News new | ask | show | jobs
by dot_treo 32 days ago
I've tried MTP, and that got me about 1.5x on average with a very spec friendly benchmark.

I didn't run the full benchmark with the demo code, just picked up a single prompt from it. The prompt is about 1300 token, the response is about 3200 token.

Baseline: 44.8 t/s With Orthrus: 164.6 t/s

Note: Don't use the `use_diffusion_mode=` config flag in their example to collect a baseline. Something about how the fallback to "normal" makes it grind to a crawl.