| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mkaic 1145 days ago
	I'm quite curious how much of the improvement on text rendering is from the switch to pixel-space diffusion vs. the switch to a much larger pretrained text encoder. I'm leaning towards the latter, which then raises the question of what happens when you try training Stable Diffusion with T5-XXL-1.1 as the text encoder instead of CLIP — does it gain the ability to do text well?

4 comments

minimaxir 1145 days ago

DeepFloyd IF is effectively the same architecture/text encoder as Imagen (https://imagen.research.google/), although that paper doesn't hypothesize why text works out a lot better.

link

mkaic 1145 days ago

Right, I'm aware of the Imagen architecture, just curious to see further research determining which aspect of it is responsible for the improved text rendering.

EDIT: According to the figure in the Imagen paper FL33TW00D's response referred me to, it looks like the text encoder size is the biggest factor in the improved model performance all-around.

link

GaggiX 1145 days ago

The CLIP text encoder is trained to align with the pooled image embedding (a single vector), which is why most text embeddings are not very meaningful on their own (but still convey the overall semantics of the text). With T5 every text embedding is important.

link

FL33TW00D 1145 days ago

Check out figure 4A from the ImageGen paper: https://arxiv.org/pdf/2205.11487.pdf

From that - I would strongly suspect the answer to your question to be yes.

link

mkaic 1145 days ago

Ah, yes, this seems to be pretty strong evidence. Thanks for pointing that figure out to me!

link

radq 1145 days ago

It is most likely due to the text encoder - see "Character-Aware Models Improve Visual Text Rendering". https://arxiv.org/abs/2212.10562

link