Hacker News new | ask | show | jobs
by sendtown_expwy 1938 days ago
If the text encoder's representation of "spiderman" and "spider" are close together, then the image encoder is penalized if it makes the image representations far apart from each other. So what we are witnessing here could simply be the neuron that says "I think 'spider' appears in the caption'". Of course it could be that the network understands the metaphorical relationship of spiderman to spiders, but the simpler explanation seems more plausible to me.
1 comments

Right, so I don't think the surprise is that the neuron responds to a picture of spiderman and a picture of a spider. What's surprising is that it responds to a picture of spiderman, and a picture of the text "spider". The thing that's interesting is not that it understands that the concept of a spider and the concept of spiderman are close together, it's that the same neuron is responsible for both the visual depiction of spiderman, and for an image of the text. In previous networks, you'd maybe have a neuron that detected a picture of text in general, but not one that would look for a particular word as well as the image it corresponds to.
If its deep in the network, though, couldn't you interpret it as a single neuron responsible for the substring "spider" in the caption? That seems less surprising. But, I see your point.
Yeah, I mean it's definitely the case that the signal here is that words occur in captions for pictures of text as for pictures of objects. But if you do this same experiment of visualizing dataset examples for other networks (like ones trained on Imagenet), you don't find this sort of multimodal neuron.
I am not sure if I get it, but isn't the CLIP "multi-modal" neuron just considered as "multi-modal" because it occurs some layers before the actual output? I am not sure, but maybe the indirection is just obfuscation and not a sign of abstraction.