Hacker News new | ask | show | jobs
by ollin 842 days ago
They use three text encoders to encode the caption:

1. CLIP-G/14 (OpenCLIP)

2. CLIP-L/14 (OpenAI)

3. T5-v1.1-XXL (Google)

They randomly disable encoders during training, so that when generating images SD3 can use any subset of the 3 encoders. They find that using T5 XXL is important only when generating images from prompts with "either highly detailed descriptions of a scene or larger amounts of written text".