|
|
|
|
|
by GaggiX
1143 days ago
|
|
The CLIP text encoder is trained to align with the pooled image embedding (a single vector), which is why most text embeddings are not very meaningful on their own (but still convey the overall semantics of the text). With T5 every text embedding is important. |
|