Hacker News new | ask | show | jobs
by sendtown_expwy 1932 days ago
The model was trained to match an image with its caption. So its unsurprising to find a neuron that fires across spider man the character and a spider: those are both instances of valid captions. Same with the "iPod" example. Seems a stretch to suggest equivalence to biological neurons (unless you think those are also trained by text supervision. Which is an open hypothesis I suppose.)
2 comments

I think the surprising thing is that it's the same exact neuron that does the different modes, not that the network has the ability to detect a photo of spiderman and a picture with the text "spiderman". You might instead imagine that the network would have some neurons specialized to reading text, and others specialized to recognizing faces, and those would be like two separate paths through the network.
I think the OP meant that the network writes spider-man in both cases. So the multi-modal "Spider-man" neuron is more a "write spider-man" neuron. It is impossible to tell (at the moment) if its forming and purpose is really comparable to how biological multi-modal "think about spider-man" neurons evolve and work. IMO it is not very probable.
Well, that's not really what's going on here. CLIP has two components - a text encoder that encoders candidate captions, and an image encoder. There's no part that does "writing" - it just makes an encoding for the image, and then sees which candidate text encoding is the most similar. Further, what's being looked at here, as I understand it, is JUST the image encoder part. The neuron in question isn't seeing or generating caption text, it's just a step along the way in trying to come up with a representation of the image. So it's surprising that that same neuron is strongly activated both by the word "spiderman" appearing in an image, and an actual picture of spiderman.
If the text encoder's representation of "spiderman" and "spider" are close together, then the image encoder is penalized if it makes the image representations far apart from each other. So what we are witnessing here could simply be the neuron that says "I think 'spider' appears in the caption'". Of course it could be that the network understands the metaphorical relationship of spiderman to spiders, but the simpler explanation seems more plausible to me.
Right, so I don't think the surprise is that the neuron responds to a picture of spiderman and a picture of a spider. What's surprising is that it responds to a picture of spiderman, and a picture of the text "spider". The thing that's interesting is not that it understands that the concept of a spider and the concept of spiderman are close together, it's that the same neuron is responsible for both the visual depiction of spiderman, and for an image of the text. In previous networks, you'd maybe have a neuron that detected a picture of text in general, but not one that would look for a particular word as well as the image it corresponds to.
If its deep in the network, though, couldn't you interpret it as a single neuron responsible for the substring "spider" in the caption? That seems less surprising. But, I see your point.
I meant "write" not in a literal sense. "CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset" Isn't this the implicit coupling between text and image that is observed as multi-modal neurons?
Well, the text encoder sees the ascii characters s-p-i-d-e-r (after byte-pair encoding). That's different from seeing a photograph of a piece of paper that says "spider" on it. It's not surprising that the network can associate a picture of spiderman with a caption that contains the text "spider", but rather that the same neuron lights up when you show it a piece of paper that says "spider" as when you show it a picture of spiderman.
Maybe I don't get something about CLIP. But won't there the same labels and as a result the same pairings for a written piece of paper with spider on it and a picture of Spiderman?
The basic idea of having concept aggregation to match incoming and outgoing information in various modes is applicable to both.