Looking at the result (figure 4), the model looks more like a good classifier than an image retriever, the generated images have the same subject as the original but are very different, even the colors of the subjects.
Exactly. The whole image generation bit, at this stage, is noise. But you could easily describe what someone is looking at. Even that is very interesting. And perhaps this will scale up to get high quality, accurate, outputs.