|
|
|
|
|
by geonic
1483 days ago
|
|
Can anybody give me short high-level explanation how the model achieves these results? I'm especially interested in the image synthesis, not the language parsing. For example, what kind of source images are used for the snake made of corn[0]? It's baffling to me how the corn is mapped to the snake body. [0] https://gweb-research-imagen.appspot.com/main_gallery_images... |
|
So text -> text representation -> most likely noised image space -> iteratively reduce noise N times -> upsample result
Something like that, please correct anything I'm missing.
Re: the snake corn question, it is mapping the "concept" of corn to the concept of a body as represented by intermediary learned vector representations.