|
|
|
|
|
by edshiro
829 days ago
|
|
This is really exciting to see. I applaud Stability AI's commitment to open source and hope they can operate for as long as possible. There was one thing I was curious about... I skimmed through the executive summary of the paper but couldn't find it. Does Stable Diffusion 3 still use CLIP from Open AI for tokenization and text embeddings? I would naively assume that they would try to improve on this part of the model's architecture to improve adherence to text and image prompts. |
|
1. CLIP-G/14 (OpenCLIP)
2. CLIP-L/14 (OpenAI)
3. T5-v1.1-XXL (Google)
They randomly disable encoders during training, so that when generating images SD3 can use any subset of the 3 encoders. They find that using T5 XXL is important only when generating images from prompts with "either highly detailed descriptions of a scene or larger amounts of written text".