|
|
|
|
|
by dave_sullivan
1483 days ago
|
|
Well, first they parse the language into a high level vector representation. Then they take images and add noise and train a model to remove the noise so it can start with a noisy image and produce a clear image from it. Then they train a model to map from the word representation for text to the noisy image representation for the corresponding image. Then they upsample twice to get to good resolution. So text -> text representation -> most likely noised image space -> iteratively reduce noise N times -> upsample result Something like that, please correct anything I'm missing. Re: the snake corn question, it is mapping the "concept" of corn to the concept of a body as represented by intermediary learned vector representations. |
|