| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by siglesias 1458 days ago
	I discovered something like this recently when I tried the prompt "man throwing his smartphone into a river," and for the life of me I could not get DALL-E to render the phone separated from the hand (I tried "like a boomerang," "tossing," "into an ocean," "like a baseball," etc). And then it occurred to me that by the training data, there are virtually no pictures of a person and a phone where the phone is separated! So DALL-E might have thought that the phone was just an appendage to the body, the way the hand is (which, what does this say about society!). I might as well have asked DALL-E to render someone throwing their elbow into a river. Another interesting case is animal-on-animal interactions. A prompt like, "small french bulldog confronts a deer in the woods" often yields weird things like the bulldog donning antlers! As far as the algorithm is concerned, it sees a bulldog, ticking the box for it, and it sees the antlers, ticking the box for "deer." The semantics don't seem to be fully formed.

1 comments

gwern 1458 days ago

I dunno man, I punched that exact prompt ("man throwing his smartphone into a river") in DALL-E 2 just now, and in 2/4 samples, the smartphone is clearly separate from the hand: labs.openai.com/s/uIldzs2efWWnm3i9XjsHI7or labs.openai.com/s/jSk4qhAxSiL7QJo7zeGp6m9f

> The semantics don't seem to be fully formed.

Yes, not so much 'formed' as 'formed and then scrambled'. This is due to unCLIP, as clearly documented in the DALL-E 2 paper, and even clearer when you contrast to the GLIDE paper (which DALL-E 2 is based on) or Imagen or Parti. Injecting the contrastive embedding to override a regular embedding tradesoff visual creativity/diversity for the semantics, so if you insist on exact semantics, DALL-E 2 samples are only a lower bound on what the model can do. It does a reasonable job, better than many systems up until like last year, but not as good as it could if you weren't forced to use unCLIP. You're only seeing what it can do after being scrambled through unCLIP. (This is why Imagen or Parti can accurately pull off what feels like absurdly complex descriptions - seriously, look at the examples in their papers! - but people also tend to describe them as 'bland'.)

visarga 1458 days ago

If you want multiple objects, each with individual attributes, the unCLIP model still has to make a single embedding vector representation. But the single vector is too small to contain an ever more detailed scene description. That's why it has failure modes like assigning the wrong colour to cubes and not being able to spell text.

On the other hand the previous approach - autoregressive generation - allows full access through the attention mechanism to the prompt.

For example Imagen encodes text to a sequence of embeddings.

> Imagen comprises a frozen T5-XXL [52] encoder to map input text into a sequence of embeddings and a 64×64 image diffusion model, followed by two super-resolution diffusion models