Hacker News new | ask | show | jobs
by stefanbaumann 885 days ago
The "input image" is just the noisy sample from the previous timestep, yes.

The overall architecture diagram does not explicitly show the conditioning mechanism, which is a small separate network. For this paper, we only trained on class-conditional ImageNet and completely unconditional megapixel-scale FFHQ.

Training large-scale text-to-image models with this architecture is something we have not yet attempted, although there's no indication that this shouldn't work with a few tweaks.

1 comments

Thank you, I'm not used to reading this kind of research papers but I think I got the gist of it now.

Can this architecture be used to distill models that need fewer timesteps like LCMs or SDXL turbo?

Both Latent Consistency Models and Adversarial Diffusion Distillation (the method behind SDXL Turbo) are methods that do not depend on any specific properties of the backbone. So, as Hourglass Diffusion Transformers are just a new kind of backbone that can be used just like the Diffusion U-Nets in Stable Diffusion (XL), these methods should also be applicable to it.