|
|
|
|
|
by Imnimo
1940 days ago
|
|
Well, that's not really what's going on here. CLIP has two components - a text encoder that encoders candidate captions, and an image encoder. There's no part that does "writing" - it just makes an encoding for the image, and then sees which candidate text encoding is the most similar. Further, what's being looked at here, as I understand it, is JUST the image encoder part. The neuron in question isn't seeing or generating caption text, it's just a step along the way in trying to come up with a representation of the image. So it's surprising that that same neuron is strongly activated both by the word "spiderman" appearing in an image, and an actual picture of spiderman. |
|