I understand neural networks, embeddings, convolutions, etc. The part that's unclear to me is specifically how textual embeddings are linked into the img-to-img network trying to reduce the noise. In other words, am missing how the process is 'conditioned upon' the text. (I lack a understanding the same for conditional GANs as well.)
If the answer is just that the textual embeddings are also fed as simple inputs to the network, I already understand then.
If the answer is just that the textual embeddings are also fed as simple inputs to the network, I already understand then.