|
|
|
|
|
by ollin
842 days ago
|
|
They use three text encoders to encode the caption: 1. CLIP-G/14 (OpenCLIP) 2. CLIP-L/14 (OpenAI) 3. T5-v1.1-XXL (Google) They randomly disable encoders during training, so that when generating images SD3 can use any subset of the 3 encoders. They find that using T5 XXL is important only when generating images from prompts with "either highly detailed descriptions of a scene or larger amounts of written text". |
|