Hacker News new | ask | show | jobs
by spwa4 31 days ago
I don't understand. This distills a diffusion transformer out of Qwen3. And while the provably identical is nice, a full diffusion transformer would be a lot faster still.
1 comments

A full diffusion transformer would need more forward passes (thus being slower) or produce worse output (because it can't properly account for dependencies between tokens when generating them independently in parallel), or both. Keeping the output identical to the autoregressive baseline ensures the speedup doesn't come at the cost of quality degradation.