Hacker News new | ask | show | jobs
by edshiro 829 days ago
This is really exciting to see. I applaud Stability AI's commitment to open source and hope they can operate for as long as possible.

There was one thing I was curious about... I skimmed through the executive summary of the paper but couldn't find it. Does Stable Diffusion 3 still use CLIP from Open AI for tokenization and text embeddings? I would naively assume that they would try to improve on this part of the model's architecture to improve adherence to text and image prompts.

2 comments

They use three text encoders to encode the caption:

1. CLIP-G/14 (OpenCLIP)

2. CLIP-L/14 (OpenAI)

3. T5-v1.1-XXL (Google)

They randomly disable encoders during training, so that when generating images SD3 can use any subset of the 3 encoders. They find that using T5 XXL is important only when generating images from prompts with "either highly detailed descriptions of a scene or larger amounts of written text".

One of the diagrams says they're using CLIP-G/14 and CLIP-L/14, which are the names of two OpenCLIP models - meaning they're not using OpenAI's CLIP.
I have just been informed that my above comment is false, the CLIP-L is in fact referring to OpenAI's, despite that also being the name of an OpenCLIP model.