|
|
|
|
|
by sorenjan
885 days ago
|
|
I was referring to the input image in the diagram, what is that and how is the output image generated from it? Is it 256x256 noise that gets denoised into an image? I guess what I'm really asking is what guides the process into the final image if it's not text to image? |
|
The overall architecture diagram does not explicitly show the conditioning mechanism, which is a small separate network. For this paper, we only trained on class-conditional ImageNet and completely unconditional megapixel-scale FFHQ.
Training large-scale text-to-image models with this architecture is something we have not yet attempted, although there's no indication that this shouldn't work with a few tweaks.