|
|
|
|
|
by samvher
1533 days ago
|
|
On (2), so this part is where I wonder: no-one has "expressive painting of a man shining rays of justice and transparency on a blue bird twitter logo" as their twitter bio. So are the "happy sisyphus" images generated from "happy sisyphus children's style", or are they generated from something more like "a person carries a large ball in a mellow image in the style of a pixar cartoon"? To me there is a huge difference between these things: how much of the context is inferred from the bio, and how much from what's provided in the prompt? (Does DALL-E 2 know about the story of Sisyphus or is that part filled in?) |
|
So I reckon with "happy sisyphus" it breaks it apart into discrete vectors as a first disambiguation step and in this case resulting in two distinct queries.
Happy returns all kinds of image results.
Sisyphus returns the same kind of image results over and over.
A man rolling a boulder up a hill. Thus it can learn the concept of "sisyphus" on the fly as it would return:
man 95% boulder 90% hill 80% etc
Over a range of images.
So it must be Man+Boulder+Hill. That's its scene cue. That's what CLIP doodles initially. That's the "find me similar images step".
Happy is the style cue.
That's how "happy sisyphus" expanded into "a person carries a large ball in a mellow image in the style of a pixar cartoon"
Why specifically the Pixar style? One of several variations it tried, selected by a human.
The thing we don't know is whether the Pixar styled image is composited from the existing images in its training set. In other words whether this can be reversed.
That character looks familiar tho. I think it is plagiarizing.
Here is another observation: the boulder is not round, it reminds me of one of the Platonic solids. I don't think that's a coincidence, heh.