Hacker News new | ask | show | jobs
by FranckDernoncou 37 days ago
Paper: https://arxiv.org/abs/2605.12825 ; Code+models: https://github.com/chiennv2000/orthrus ; Disclosure: co-author.

Idea: Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. Both heads share one KV cache. Diffusion head projects K=32 tokens in parallel; AR head verifies in a second pass and accepts the longest matching prefix. Output distribution is provably identical to the base model.

Results:

- Up to 7.8x TPF, ~6x wall-clock on MATH-500.

- 16% of params trained, <1B tokens, 24h on 8xH200.

- vs. diffusion LMs (Dream, Fast-dLLM-v2, SDAR, Mercury, Gemini Diffusion): they modify base weights and lose accuracy (Fast-dLLM-v2: -11 pts on MATH-500). Orthrus freezes the backbone; accuracy matches Qwen3-8B exactly.

- vs. Speculative Decoding (EAGLE-3, DFlash): no external drafter, no separate cache, zero TTFT penalty (no drafter to init/sync). KV overhead is O(1) (~4.5 MiB flat). Acceptance length on MATH-500: 11.7 vs. 7.9 (DFlash) vs. 3.5 (EAGLE-3).

- Single-step denoising beats multi-step (6.35 vs. 3.53 TPF). KL distillation beats CE on acceptance rate.

Limitations: strictly bounded by the frozen base model (inherits its biases, hallucinations, knowledge gaps); Qwen3-only evaluation; greedy + rejection sampling only.

7 comments

Amazing. Is it possible to do this with Qwen 3.6 27B? Will it work with quants (I assume so)?
From a quick and shallow view of the paper, it looks very feasible (with a little tinkering ) to be adapted to qwen3.6 27B. The process looks somewhat similar to training a LoRA, or in a way distilling your own model so that a mini model learns how to imitate it, and you glue them. I might bite the bullet and rent a gpu to do it for 3.6 27b, as this will solve a lot of my problems.
Scratch that, I don't have that kind of money, and 3.5's architecture is a little more divergent from 3's, so it will be a bit less trivial. It does look possible, just not on a student's paycheck.
There are websites that let you rent GPUs for cheap, such as QuickPod. Have you checked those P2P GPU rentals out?
My plan is to validate it first using qwen3.5 0.8B if it even works (as it has the same architecture as qwen3.6 27b, just scaled down a bit) on my 3090. If it does, I'll make a git about the process if anyone wants to use my approach, while I try to convince my uni to lend me h100s for a day.
If anyone is interested in watching my 0.8B experiments: https://orthrus.kokoham.com/ . The current code is here: https://git.kokoham.com/sleepy/qwen_orthrus .

The hard part was that the original Orthrus works with transformers, but 3.5(and 3.6) is Hybrid: 75% GatedDeltaNet + 25% GatedAttention. I am testing a trick that might make is work with the GatedDeltaNet, and dry runs are promising, but only a full train will reveal if it works. More information in the repo and on the site under the "What is this all about?" button.

Note: i may restart it or try different configs at different points, if the site is down there is probably some sort of result/conclusion in the repo.

3.6 already supports multi token generation AFAIK
Yes, but not diffusion based, it's still doing token-at-a-time speculation.
I thought it can do multiple tokens at a time
Think of this as another way of achieving that. This theoretically has a higher ceiling of how much it can predict at a time. And more importantly is a lot more memory efficient during actual inference.
There was a chart from the Unsloth folks posted to Reddit in the last couple of days which showed that the draft sweet spot for MTP was 2-3 tokens ahead depending on the quant. Thats not much, and I think this might do a lot better. The whole "provably identical distribution" thing is doing a lot of work in my head, and I don't think that's true of the MTP model in qwen's architecture.
So, it's D-Flash but at each transformer layer and share the KV cache of the original model? Very smart!
Kindof yeah - predictivity is a question though for larger layers - when trying to scale this up. But yeah, this is a "95% predictor in latent space is a 7x improvement in speed if done right" approach.
I'm sure I don't understand all the technical aspects, but I do understand that this is frickin' cool. Nice work.
Really cool work!

Does the training data budget scale with model size?

How would you compare the Gemma 4 draft model which is also integrated with the base kv cache?

On the limitation side:

Do you think this would scale to larger transformer models with more parameters per layer?

How would this work with MOE models or sparse models?

Do you plan on releasing the training code?
BTW the paper says

> Since only (Qdiff,Kdiff,Vdiff) are updated during training, the total number of trainable parameters is approximately 16% of the full model.

But the code defines q_proj_diff, k_proj_diff, v_proj_diff, and o_proj_diff, and it only matches 16% when you include the O term.