Hacker News new | ask | show | jobs
by Imnimo 1935 days ago
Right, so I don't think the surprise is that the neuron responds to a picture of spiderman and a picture of a spider. What's surprising is that it responds to a picture of spiderman, and a picture of the text "spider". The thing that's interesting is not that it understands that the concept of a spider and the concept of spiderman are close together, it's that the same neuron is responsible for both the visual depiction of spiderman, and for an image of the text. In previous networks, you'd maybe have a neuron that detected a picture of text in general, but not one that would look for a particular word as well as the image it corresponds to.
1 comments

If its deep in the network, though, couldn't you interpret it as a single neuron responsible for the substring "spider" in the caption? That seems less surprising. But, I see your point.
Yeah, I mean it's definitely the case that the signal here is that words occur in captions for pictures of text as for pictures of objects. But if you do this same experiment of visualizing dataset examples for other networks (like ones trained on Imagenet), you don't find this sort of multimodal neuron.
I am not sure if I get it, but isn't the CLIP "multi-modal" neuron just considered as "multi-modal" because it occurs some layers before the actual output? I am not sure, but maybe the indirection is just obfuscation and not a sign of abstraction.