Hacker News new | ask | show | jobs
by Imnimo 1934 days ago
The labels are just whatever people on the internet wrote next to the image. Certainly there are some instances of things like "this is a picture that says 'spider'" or whatever (probably a little more natural than that), or else the network would have no way of learning to read. But what's interesting here is that it's the same neuron doing the reading and doing the recognizing of Spiderman's head. That's not the only way that it could have solved the problem. There could have been some dimensions of the representation vector used for reading text, and other for recognizing visual objects, and those would be handled by separate subsets of neurons in the network.
1 comments

Maybe it just recognizes pixel soups? Why should it know the difference between a piece of text and a real spider? It's just our interpretation of the image that makes it "multi-modal". CLIP probably just categorizes certain kinds of white and black patterns as a special kind of spider that happens to also look like a piece of paper and instances of text.
I mean, yeah, it does just recognize pixel soups - all the neurons are just semi-scrutable combinations of other features. It's probably the case that there are some early neurons that recognize various letters, and so you'd have some subset of neurons that are shared between the "spiderman" neuron circuit and the circuits that are used by other neurons that recognize other words. I don't know how much credit you'd give that for "reading", but I'd say it at least would qualify as multi-modal.
So if the network can recognize species by their rear view you would consider it also multi-modal? Because that is what I am trying to tell... there are probably no higher level concepts here, just various different pixel soups that happen to need the same label.
If the trained network produces common labels for sets of pixel soups that we consider to be semantically related, but are not visually related, then that is interesting.
I agree. But isn't it more probable that there forms some kind of arbitrary "OR" logic than "real abstraction" which is indicated by choosing the word "multi-modal".

I guess we see something like this:

  e.g. "Photo of spider" -> Hierarchy of pixel soups -> "Photo of spider" OR "Photo of word spider" OR "Spider rear view" OR "Spiderman" OR ... -> [Spider]
What I think the authors want to tell me when calling it multi-modal:

  "Photo of spider" -> "Characteristics of a spider" -> [Spider]
  "Photo of word spider" -> "Letters S-P-I-D-E-R" -> "Written word spider" -> [Spider]
I guess there are two ways of looking at that question.

One is just basic generalisation - do these neurons effectively capture things within their semantic group (whatever that means) but completely outside of the training data. If yes, then I guess the answer might be 'yes, in some sense it is like a "real abstraction"'.

Second, and (afaik) currently a more philosophical framing - it isn't obvious whether "(sufficiently advanced) OR logic" and "real abstraction", are actually different. Additionally, for the purposes of a model like this one, I find it hard to see how they could be different. The best the model can do is (roughly speaking) assign neurons to particular concepts, be they ones that fit with our mental models of the world, or ones that are more "functional". The better a job it can do of the former, the more we might be inclined to believe that it is modelling things as we understand them.