| HN Mirror

I know as much about how to get the best image outputs from text inputs as the person who designed an airport knows the best place to eat in it. The emergent properties of the system are a result of the data put into it, so I can only discuss the system itself, not what it ended up doing with the data in that system.

The models are a product of their datasets, specifically the relationship of the images and prompts via CLIP. CLIP puts both images and text into coordinate space, imagine just a 2D graph. It tries to assure that for any real image and its caption, they will each be each others closest neighbor in that coordinate space.

So if you want a certain image, you have to ask "what caption would be most likely and most uniquely given to the image I'm imagining".

I'm sure this advice is way less helpful than what you find in prompt engineering discord channels and guides I've seen.