| HN Mirror

> Since guidance weights are used to control image quality and text alignment, we also report ablation results using curves that show the trade-off between CLIP and FID scores as a function of the guidance weights (see Fig. A.5a). We observe that larger variants of T5 encoder results in both better image-text alignment, and image fidelity. This emphasizes the effectiveness of large frozen text encoders for text-to-image models

I usually consider myself fairly intelligent, but I know that when I read an AI research paper I'm going to feel dumb real quick. All I managed to extract from the paper was a) there isn't a clear explanation of how it's done that was written for lay people and b) they are concerned about the quality and biases in the training sets.

Having thought about the problem of "building" an artificial means to visualize from thought, I have a very high level (dumb) view of this. Some human minds are capable of generating synthetic images from certain terms. If I say "visualize a GREEN apple sitting on a picnic table with a checkerboard table cloth", many people will create an image that approximately matches the query. They probably also see a red and white checkerboard cloth because that's what most people have trained their models on in the past. By leaving that part out of the query we can "see" biases "in the wild".

Of course there are people that don't do generative in-mind imagery, but almost all of us do build some type of model in real time from our sensor inputs. That visual model is being continuously updated and is what is perceived by the mind "as being seen". Or, as the Gorillaz put it:

  … For me I say God, y'all can see me now
  'Cos you don't see with your eye
  You perceive with your mind
  That's the end of it…

To generatively produce strongly accurate imagery from text, a system needs enough reference material in the document collection. It needs to have sampled a lot of images of corn and snakes. It needs to be able to do image segmentation and probably perspective estimation. It needs a lot of semantic representations (optimized query of words) of what is being seen in a given image, across multiple "viewing models", even from humans (who also created/curated the collections). It needs to be able to "know" what corn looks like, even from the perspective of another model. It needs to know what "shape" a snake model takes and how combining the bitmask of the corn will affect perspective and framing of the final image. All of this information ends up inside the model's network.

Miika Aittala at Nvidia Research has done several presentations on taking a model (imagined as a wireframe) and then mapping a bitmapped image onto it with a convolutional neural network. They have shown generative abilities for making brick walls that looks real, for example, from images of a bunch of brick walls and running those on various wireframes.

Maybe Imagen is an example of the next step in this, by using diffusion models instead of the CNN for the generator and adding in semantic text mappings while varying the language models weights (i.e. allowing the language model to more broadly use related semantics when processing what is seen in a generated image). I'm probably wrong about half that.

Here's my cut on how I saw this working from a few years ago: https://storage.googleapis.com/mitta-public/generate.PNG

Regardless of how it works, it's AMAZING that we are here now. Very exciting!