|
|
|
|
|
by stefanbaumann
885 days ago
|
|
The "input image" is just the noisy sample from the previous timestep, yes. The overall architecture diagram does not explicitly show the conditioning mechanism, which is a small separate network. For this paper, we only trained on class-conditional ImageNet and completely unconditional megapixel-scale FFHQ. Training large-scale text-to-image models with this architecture is something we have not yet attempted, although there's no indication that this shouldn't work with a few tweaks. |
|
Can this architecture be used to distill models that need fewer timesteps like LCMs or SDXL turbo?