The tweet is in response to a preliminary paper [1] [2] studying text found in images generated by, e.g., "Two whales talking about food, with
subtitles." DALL-E doesn't generate meaningful text strings in the images, but if you feed the gibberish text it produces -- "Wa ch zod ahaakes rea." back into the system as a prompt, you would get semantically meaningful images, e.g., pictures of fish and shrimp.
I think the tweeter is being a bit too pedantic. Personally I spent some time thinking about embeddings, manifolds, the structure of language, scientific naming, and what the decoding of the points near the center of clusters in embedding spaces look like (archetypes), after seeing this paper. I think making networks and asking them to explain themselves using their own capabilities is a wonderful idea that will turn out to be a fruitful area of research in its own right.
If DALL-E had a choice to output "Command not understood", maybe we wouldn't be discussing this.
Like those AIs that guess what you draw, and recognize random doodling as "clouds", DALL-E is probably using the least unlikely route. That a gibberish word is drawn as a bird is maybe because it was "bird (2%), goat (1%), radish (1%)".
That's extremely optimisic. When faced with gibberish, the "confidences" are routinely 90%+ as with "meaningful" input.
It's almost as-if its an illusion designed to fool, we, the users.. by only providing inputs meaningful to us, we come to the foolish idea that it understands these inputs.
This is a good point. The fact that DALL-E will try to render something, no matter how meaningless the input, is a trait it has in common with many neural networks. If you want to use them for actual work, they should be able to fail rather than freestyle.
Especially since his results confirm most of what the original thread claimed. A couple of the inputs did not reliably replicate, but "for the most part, they're not true" seems straightforwardly false. He even seems to deliberately ignore this sometimes, such as when he says "I don't see any bugs" when there is very obviously a bug in the beak of all but two or three of the birds.
When I zoomed in, I felt only four in ten birds clearly had anything in their beaks, and in each case it looked like vegetable matter. In the original set, only one clearly has an insect in its beak.
Not really, he afterwards says that he was more trying to inject some humility. He really doesn't think this is measuring anything of interest. For the birds result in particular, see https://twitter.com/BarneyFlames/status/1531736708903051265.
If I read what that tweet says properly, the system ended up outputting things that were almost scientific nomenclature for the general class of items it was being asked to draw. There are probably many examples of "bird is an instance of class X" in the text but they are not consistent, and the resulting token vector is a point near the center of "birdspace".
> asking [neural networks] to explain themselves using their own capabilities
Exactly. This could be profound. I'm looking forward to further work here. Sure, the examples here are daft, but developing this approach could be like understanding a talking lion [0] only this time it's a lion of our making.
I think it’s more likely we can train two neural networks, one to make the decision and one to take the same inputs (or the same inputs plus the output from the first one) and generate plausible language to explain the first. This seems to correspond to what we dimwits consciousness and frankly I would doubt one system can accurately explain its own mechanism. People surely can’t.
It’s a fruitful area of research for sure, but there is a huge gap between “it invented pig Latin” and “it invented Esperanto/Lojban”. Referring to the first as inventing a language is very misleading.