Hacker News new | ask | show | jobs
by recuter 1534 days ago
I don't want to be dismissive of Dall-E itself or its authors. Just the implications that this changes everything or how it is much more than it really is.

https://twitter.com/nickcammarata/status/1512123067803344899...

Prompt: "expressive painting of a man shining rays of justice and transparency on a blue bird twitter logo"

You have to break the concepts up apart (which is one of the things Dall-E improved on).

As such: "expressive blue bird"

In google image search, type clipart, and I even get pill tags to further narrow it down to illustrations for animal paintings and so forth. Google's classifier knows the concept of a "blue bird" and expressionism too.

https://www.google.com/search?q=expressive+blue+bird&tbm=isc...

The same for "ray of light". In fact the top results there I get pngs of sun beams on a transparent background. Which is perfect.

Neither the birds nor the rays of light in the pictures it produced are truly its own creations but lifted from bits of pictures in its training set. I bet you could find the exact bird from the second row online in many places for example. It just won't be blue or stylized.

Composite those things together manually and add a style transfer you'll get similar results to DALL-E as that is what it is doing more or less.

3 comments

> Composite those things together manually and add a style transfer you'll get similar results to DALL-E as that is what it is doing more or less.

If you try actually doing this it will be trivial to see that this assertion is incorrect.

1. The way in which the elements of the images are integrated together is deeper than the level of style. For instance, see the image in the top row, second column: it has integrated the blue bird wings onto the man, not only simply grafting them on, but giving the appearance of their being draped on like a cloak, partly behind and partly in front of him (+ it's consistent with the man's posture and the rays of light to evoke a certain coherent cultural idea/image). You might be able to integrate multiple images (of man, bird, rays etc.) together and style transfer to arrive at a poor approximation of this—but even then, the decision to place the elements together in such a way would require creativity on your part.

2. The one example set of of trial images (generated from the phrase "expressive painting of a man shining rays of justice and transparency on a blue bird twitter logo") is one of the easiest among the full group to pick its various elements apart; if you try this thought experiment with the others in the thread, you'll see this idea is by far insufficient.

Good, finally. Yes, exactly - this is the most interesting aspect of the whole thing.

> the decision to place the elements together in such a way would require creativity on your part

I strongly suspect that's because it found similar compositions in its training set. So what exactly is going on here is fascinating.

Did it learn compositing? Is that why the image output is now much more stable? Or is it mearly finding similar artwork and competently recreating/mimicking existing compositions from different building blocks? So now we can not only transfer styles but also transfer compositions. That could be the beginning of something useful. Instead of a text prompt I'd give it my crappy doodle and it will respond with an improved/different one that is comparable (also a great way to steal tho).

And of course I picked the one that is easiest to tease apart where it is most evident so people will see what I mean.

> if you try this thought experiment with the others in the thread, you'll see this idea is by far insufficient

That depends on your imagination and your artistic eye I guess. Even if somebody could do that they certainly couldn't make you believe them. That's the accomplishment.

Neither one of us can prove it one way or the other so long as the model is a black box. And certainly so long as we don't have direct access to openai but just to curated examples.

On (2), so this part is where I wonder: no-one has "expressive painting of a man shining rays of justice and transparency on a blue bird twitter logo" as their twitter bio. So are the "happy sisyphus" images generated from "happy sisyphus children's style", or are they generated from something more like "a person carries a large ball in a mellow image in the style of a pixar cartoon"? To me there is a huge difference between these things: how much of the context is inferred from the bio, and how much from what's provided in the prompt? (Does DALL-E 2 know about the story of Sisyphus or is that part filled in?)
In the video accompanying the paper they gave the example of "tree bark". Do we mean the bark of a tree or a dog barking at a tree?

So I reckon with "happy sisyphus" it breaks it apart into discrete vectors as a first disambiguation step and in this case resulting in two distinct queries.

Happy returns all kinds of image results.

Sisyphus returns the same kind of image results over and over.

A man rolling a boulder up a hill. Thus it can learn the concept of "sisyphus" on the fly as it would return:

man 95% boulder 90% hill 80% etc

Over a range of images.

So it must be Man+Boulder+Hill. That's its scene cue. That's what CLIP doodles initially. That's the "find me similar images step".

Happy is the style cue.

That's how "happy sisyphus" expanded into "a person carries a large ball in a mellow image in the style of a pixar cartoon"

Why specifically the Pixar style? One of several variations it tried, selected by a human.

The thing we don't know is whether the Pixar styled image is composited from the existing images in its training set. In other words whether this can be reversed.

That character looks familiar tho. I think it is plagiarizing.

Here is another observation: the boulder is not round, it reminds me of one of the Platonic solids. I don't think that's a coincidence, heh.

You're asserting a bunch of things about how it works that have no basis in reality. If you want to be able to comment on this stuff with any accuracy, read the research they've published.
They are generated from e.g. "happy sisyphus". My understanding is there are separate additional controls for style (though it's flexible enough you could give hints in the text, too, which is I gather where the "expressive" word fits in).
I think your last line is what stands out more than anything. You've just described creating something without "compositing those things together manually."

Note that in that example the "twitter bird logo" is actually expressed in 6 out of all of those images. Look for the small bird, that looks like the Twitter logo. It's there. It's doing the thing.

The prompt is actually "blue bird twitter logo".

Nothing is expressed. Find yourself a blue bird in an expressionistic style, go to google image search and give it the url. Click on tools -> visually similar.

Enjoy an endless supply of things to plagiarize. In the middle picture of the second row you can clearly see how several pre-existing images are sharply cut off before being re-blended.

Same thing going on here as in your other comments.

Tech like CLIP, GPT-3, DALL-E, etc. are indeed nearing the sophistication (w. caveats around outliers and harmful outputs) of Google search.

It took a lot of people to create Google search. It took precisely one training run for DALL-E 2 to create this.

edit: Removed toxic comment.

No, don't get me wrong. I think DALL-E is very interesting and a potentially useful tool and have nothing against the tool makers.

The tool wielders however.. I think are overyhyping this to say the least. And focusing on the wrong bits. It isn't sentient and it is not making art. But teasing apart how it is deriving these images might shake out serious advancements.

Fair enough I think we are in agreement.
> Prompt: "expressive painting of a man shining rays of justice and transparency on a blue bird twitter logo"

Yeah, weak results. None of the men look anything like Elon Musk. /s