| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sleepyeldrazi 35 days ago
	From a quick and shallow view of the paper, it looks very feasible (with a little tinkering ) to be adapted to qwen3.6 27B. The process looks somewhat similar to training a LoRA, or in a way distilling your own model so that a mini model learns how to imitate it, and you glue them. I might bite the bullet and rent a gpu to do it for 3.6 27b, as this will solve a lot of my problems.

1 comments

sleepyeldrazi 35 days ago

Scratch that, I don't have that kind of money, and 3.5's architecture is a little more divergent from 3's, so it will be a bit less trivial. It does look possible, just not on a student's paycheck.

link

Boranbruh 35 days ago

There are websites that let you rent GPUs for cheap, such as QuickPod. Have you checked those P2P GPU rentals out?

link

sleepyeldrazi 35 days ago

My plan is to validate it first using qwen3.5 0.8B if it even works (as it has the same architecture as qwen3.6 27b, just scaled down a bit) on my 3090. If it does, I'll make a git about the process if anyone wants to use my approach, while I try to convince my uni to lend me h100s for a day.

link

sleepyeldrazi 35 days ago

If anyone is interested in watching my 0.8B experiments: https://orthrus.kokoham.com/ . The current code is here: https://git.kokoham.com/sleepy/qwen_orthrus .

The hard part was that the original Orthrus works with transformers, but 3.5(and 3.6) is Hybrid: 75% GatedDeltaNet + 25% GatedAttention. I am testing a trick that might make is work with the GatedDeltaNet, and dry runs are promising, but only a full train will reveal if it works. More information in the repo and on the site under the "What is this all about?" button.

Note: i may restart it or try different configs at different points, if the site is down there is probably some sort of result/conclusion in the repo.

link

dot_treo 35 days ago

And it also looks like the original authors are working on qwen 3.5 too: https://github.com/chiennv2000/orthrus/issues/1#issuecomment...

link

dot_treo 35 days ago

I would probably treat the (3 GatedDeltaNet + 1 GatedAttention) Blocks as one transformer block, when generating next steps one would therefore use the kv cache for the gated attention and skip the entire delta nets.

link

sleepyeldrazi 35 days ago

It is actually very exciting that they are also working on 3.5, I will keep this toy project up in the meantime, trying it out and testing things around it helps me learn a bunch.

As for the treating them as a block idea, that was my initial plan, but the GatedDeltaNet is doing most of the work in 3.5. Trying to bundle them together would hurt acceptance rates drastically, potentially making the speed benefits not a lot bigger, or smaller, than the native MTP.

link