|
|
|
|
|
by nuccy
1396 days ago
|
|
Thanks for answering. Since you mentioned your work on text-to-3d, what are the ways to enhance the image/3d model to actually be photo-(or rather reality)-realistic? Even (presumably) hand-picked examples from google on the linked page lack support bars of the sunglasses, include floating cups of wine with base-less Eiffel tower in the background. P.S. It seems raccoons are unimaginable (even for AI) with any sunglasses: if photo-realistic mode is selected for a raccoon, changing to "wearing a sunglasses and" makes no difference :) |
|
The models are a product of their datasets, specifically the relationship of the images and prompts via CLIP. CLIP puts both images and text into coordinate space, imagine just a 2D graph. It tries to assure that for any real image and its caption, they will each be each others closest neighbor in that coordinate space.
So if you want a certain image, you have to ask "what caption would be most likely and most uniquely given to the image I'm imagining".
I'm sure this advice is way less helpful than what you find in prompt engineering discord channels and guides I've seen.