Hacker News new | ask | show | jobs
by yorwba 31 days ago
A full diffusion transformer would need more forward passes (thus being slower) or produce worse output (because it can't properly account for dependencies between tokens when generating them independently in parallel), or both. Keeping the output identical to the autoregressive baseline ensures the speedup doesn't come at the cost of quality degradation.