|
|
|
|
|
by siglesias
1413 days ago
|
|
I discovered something like this recently when I tried the prompt "man throwing his smartphone into a river," and for the life of me I could not get DALL-E to render the phone separated from the hand (I tried "like a boomerang," "tossing," "into an ocean," "like a baseball," etc). And then it occurred to me that by the training data, there are virtually no pictures of a person and a phone where the phone is separated! So DALL-E might have thought that the phone was just an appendage to the body, the way the hand is (which, what does this say about society!). I might as well have asked DALL-E to render someone throwing their elbow into a river. Another interesting case is animal-on-animal interactions. A prompt like, "small french bulldog confronts a deer in the woods" often yields weird things like the bulldog donning antlers! As far as the algorithm is concerned, it sees a bulldog, ticking the box for it, and it sees the antlers, ticking the box for "deer." The semantics don't seem to be fully formed. |
|
> The semantics don't seem to be fully formed.
Yes, not so much 'formed' as 'formed and then scrambled'. This is due to unCLIP, as clearly documented in the DALL-E 2 paper, and even clearer when you contrast to the GLIDE paper (which DALL-E 2 is based on) or Imagen or Parti. Injecting the contrastive embedding to override a regular embedding tradesoff visual creativity/diversity for the semantics, so if you insist on exact semantics, DALL-E 2 samples are only a lower bound on what the model can do. It does a reasonable job, better than many systems up until like last year, but not as good as it could if you weren't forced to use unCLIP. You're only seeing what it can do after being scrambled through unCLIP. (This is why Imagen or Parti can accurately pull off what feels like absurdly complex descriptions - seriously, look at the examples in their papers! - but people also tend to describe them as 'bland'.)