|
|
|
|
|
by chromanoid
1936 days ago
|
|
I meant "write" not in a literal sense. "CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset" Isn't this the implicit coupling between text and image that is observed as multi-modal neurons? |
|