| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by FranckDernoncou 37 days ago

Paper: https://arxiv.org/abs/2605.12825 ; Code+models: https://github.com/chiennv2000/orthrus ; Disclosure: co-author.

Idea: Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. Both heads share one KV cache. Diffusion head projects K=32 tokens in parallel; AR head verifies in a second pass and accepts the longest matching prefix. Output distribution is provably identical to the base model.

Results:

- Up to 7.8x TPF, ~6x wall-clock on MATH-500.

- 16% of params trained, <1B tokens, 24h on 8xH200.

- vs. diffusion LMs (Dream, Fast-dLLM-v2, SDAR, Mercury, Gemini Diffusion): they modify base weights and lose accuracy (Fast-dLLM-v2: -11 pts on MATH-500). Orthrus freezes the backbone; accuracy matches Qwen3-8B exactly.

- vs. Speculative Decoding (EAGLE-3, DFlash): no external drafter, no separate cache, zero TTFT penalty (no drafter to init/sync). KV overhead is O(1) (~4.5 MiB flat). Acceptance length on MATH-500: 11.7 vs. 7.9 (DFlash) vs. 3.5 (EAGLE-3).

- Single-step denoising beats multi-step (6.35 vs. 3.53 TPF). KL distillation beats CE on acceptance rate.

Limitations: strictly bounded by the frozen base model (inherits its biases, hallucinations, knowledge gaps); Qwen3-only evaluation; greedy + rejection sampling only.

7 comments

ilaksh 37 days ago

Amazing. Is it possible to do this with Qwen 3.6 27B? Will it work with quants (I assume so)?

sleepyeldrazi 37 days ago

From a quick and shallow view of the paper, it looks very feasible (with a little tinkering ) to be adapted to qwen3.6 27B. The process looks somewhat similar to training a LoRA, or in a way distilling your own model so that a mini model learns how to imitate it, and you glue them. I might bite the bullet and rent a gpu to do it for 3.6 27b, as this will solve a lot of my problems.

sleepyeldrazi 37 days ago

Scratch that, I don't have that kind of money, and 3.5's architecture is a little more divergent from 3's, so it will be a bit less trivial. It does look possible, just not on a student's paycheck.

Boranbruh 37 days ago

There are websites that let you rent GPUs for cheap, such as QuickPod. Have you checked those P2P GPU rentals out?

sleepyeldrazi 37 days ago

My plan is to validate it first using qwen3.5 0.8B if it even works (as it has the same architecture as qwen3.6 27b, just scaled down a bit) on my 3090. If it does, I'll make a git about the process if anyone wants to use my approach, while I try to convince my uni to lend me h100s for a day.

sleepyeldrazi 36 days ago

If anyone is interested in watching my 0.8B experiments: https://orthrus.kokoham.com/ . The current code is here: https://git.kokoham.com/sleepy/qwen_orthrus .

The hard part was that the original Orthrus works with transformers, but 3.5(and 3.6) is Hybrid: 75% GatedDeltaNet + 25% GatedAttention. I am testing a trick that might make is work with the GatedDeltaNet, and dry runs are promising, but only a full train will reveal if it works. More information in the repo and on the site under the "What is this all about?" button.

Note: i may restart it or try different configs at different points, if the site is down there is probably some sort of result/conclusion in the repo.

0-_-0 36 days ago

3.6 already supports multi token generation AFAIK

jbellis 36 days ago

Yes, but not diffusion based, it's still doing token-at-a-time speculation.

0-_-0 36 days ago

I thought it can do multiple tokens at a time

sleepyeldrazi 36 days ago

Think of this as another way of achieving that. This theoretically has a higher ceiling of how much it can predict at a time. And more importantly is a lot more memory efficient during actual inference.

regularfry 36 days ago

There was a chart from the Unsloth folks posted to Reddit in the last couple of days which showed that the draft sweet spot for MTP was 2-3 tokens ahead depending on the quant. Thats not much, and I think this might do a lot better. The whole "provably identical distribution" thing is doing a lot of work in my head, and I don't think that's true of the MTP model in qwen's architecture.

littlestymaar 37 days ago

So, it's D-Flash but at each transformer layer and share the KV cache of the original model? Very smart!

foobar10000 36 days ago

Kindof yeah - predictivity is a question though for larger layers - when trying to scale this up. But yeah, this is a "95% predictor in latent space is a 7x improvement in speed if done right" approach.

deflator 34 days ago

I'm sure I don't understand all the technical aspects, but I do understand that this is frickin' cool. Nice work.

jbellis 36 days ago

Really cool work!

Does the training data budget scale with model size?

How would you compare the Gemma 4 draft model which is also integrated with the base kv cache?

gkapur 36 days ago

On the limitation side:

Do you think this would scale to larger transformer models with more parameters per layer?

How would this work with MOE models or sparse models?

dot_treo 37 days ago

Do you plan on releasing the training code?

jbellis 36 days ago

BTW the paper says

> Since only (Qdiff,Kdiff,Vdiff) are updated during training, the total number of trainable parameters is approximately 16% of the full model.

But the code defines q_proj_diff, k_proj_diff, v_proj_diff, and o_proj_diff, and it only matches 16% when you include the O term.