| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Imnimo 1940 days ago
	Well, that's not really what's going on here. CLIP has two components - a text encoder that encoders candidate captions, and an image encoder. There's no part that does "writing" - it just makes an encoding for the image, and then sees which candidate text encoding is the most similar. Further, what's being looked at here, as I understand it, is JUST the image encoder part. The neuron in question isn't seeing or generating caption text, it's just a step along the way in trying to come up with a representation of the image. So it's surprising that that same neuron is strongly activated both by the word "spiderman" appearing in an image, and an actual picture of spiderman.

2 comments

sendtown_expwy 1940 days ago

If the text encoder's representation of "spiderman" and "spider" are close together, then the image encoder is penalized if it makes the image representations far apart from each other. So what we are witnessing here could simply be the neuron that says "I think 'spider' appears in the caption'". Of course it could be that the network understands the metaphorical relationship of spiderman to spiders, but the simpler explanation seems more plausible to me.

link

Imnimo 1940 days ago

Right, so I don't think the surprise is that the neuron responds to a picture of spiderman and a picture of a spider. What's surprising is that it responds to a picture of spiderman, and a picture of the text "spider". The thing that's interesting is not that it understands that the concept of a spider and the concept of spiderman are close together, it's that the same neuron is responsible for both the visual depiction of spiderman, and for an image of the text. In previous networks, you'd maybe have a neuron that detected a picture of text in general, but not one that would look for a particular word as well as the image it corresponds to.

link

sendtown_expwy 1940 days ago

If its deep in the network, though, couldn't you interpret it as a single neuron responsible for the substring "spider" in the caption? That seems less surprising. But, I see your point.

link

Imnimo 1940 days ago

Yeah, I mean it's definitely the case that the signal here is that words occur in captions for pictures of text as for pictures of objects. But if you do this same experiment of visualizing dataset examples for other networks (like ones trained on Imagenet), you don't find this sort of multimodal neuron.

link

chromanoid 1939 days ago

I am not sure if I get it, but isn't the CLIP "multi-modal" neuron just considered as "multi-modal" because it occurs some layers before the actual output? I am not sure, but maybe the indirection is just obfuscation and not a sign of abstraction.

link

chromanoid 1940 days ago

I meant "write" not in a literal sense. "CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset" Isn't this the implicit coupling between text and image that is observed as multi-modal neurons?

link

Imnimo 1940 days ago

Well, the text encoder sees the ascii characters s-p-i-d-e-r (after byte-pair encoding). That's different from seeing a photograph of a piece of paper that says "spider" on it. It's not surprising that the network can associate a picture of spiderman with a caption that contains the text "spider", but rather that the same neuron lights up when you show it a piece of paper that says "spider" as when you show it a picture of spiderman.

link

chromanoid 1940 days ago

Maybe I don't get something about CLIP. But won't there the same labels and as a result the same pairings for a written piece of paper with spider on it and a picture of Spiderman?

link

Imnimo 1940 days ago

The labels are just whatever people on the internet wrote next to the image. Certainly there are some instances of things like "this is a picture that says 'spider'" or whatever (probably a little more natural than that), or else the network would have no way of learning to read. But what's interesting here is that it's the same neuron doing the reading and doing the recognizing of Spiderman's head. That's not the only way that it could have solved the problem. There could have been some dimensions of the representation vector used for reading text, and other for recognizing visual objects, and those would be handled by separate subsets of neurons in the network.

link

chromanoid 1940 days ago

Maybe it just recognizes pixel soups? Why should it know the difference between a piece of text and a real spider? It's just our interpretation of the image that makes it "multi-modal". CLIP probably just categorizes certain kinds of white and black patterns as a special kind of spider that happens to also look like a piece of paper and instances of text.

link

Imnimo 1940 days ago

I mean, yeah, it does just recognize pixel soups - all the neurons are just semi-scrutable combinations of other features. It's probably the case that there are some early neurons that recognize various letters, and so you'd have some subset of neurons that are shared between the "spiderman" neuron circuit and the circuits that are used by other neurons that recognize other words. I don't know how much credit you'd give that for "reading", but I'd say it at least would qualify as multi-modal.

link