| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chromanoid 1936 days ago
	Maybe it just recognizes pixel soups? Why should it know the difference between a piece of text and a real spider? It's just our interpretation of the image that makes it "multi-modal". CLIP probably just categorizes certain kinds of white and black patterns as a special kind of spider that happens to also look like a piece of paper and instances of text.

1 comments

Imnimo 1936 days ago

I mean, yeah, it does just recognize pixel soups - all the neurons are just semi-scrutable combinations of other features. It's probably the case that there are some early neurons that recognize various letters, and so you'd have some subset of neurons that are shared between the "spiderman" neuron circuit and the circuits that are used by other neurons that recognize other words. I don't know how much credit you'd give that for "reading", but I'd say it at least would qualify as multi-modal.

link

chromanoid 1936 days ago

So if the network can recognize species by their rear view you would consider it also multi-modal? Because that is what I am trying to tell... there are probably no higher level concepts here, just various different pixel soups that happen to need the same label.

link

mrow84 1935 days ago

If the trained network produces common labels for sets of pixel soups that we consider to be semantically related, but are not visually related, then that is interesting.

link

chromanoid 1935 days ago

I agree. But isn't it more probable that there forms some kind of arbitrary "OR" logic than "real abstraction" which is indicated by choosing the word "multi-modal".

I guess we see something like this:

  e.g. "Photo of spider" -> Hierarchy of pixel soups -> "Photo of spider" OR "Photo of word spider" OR "Spider rear view" OR "Spiderman" OR ... -> [Spider]

What I think the authors want to tell me when calling it multi-modal:

  "Photo of spider" -> "Characteristics of a spider" -> [Spider]
  "Photo of word spider" -> "Letters S-P-I-D-E-R" -> "Written word spider" -> [Spider]

link

mrow84 1935 days ago

I guess there are two ways of looking at that question.

One is just basic generalisation - do these neurons effectively capture things within their semantic group (whatever that means) but completely outside of the training data. If yes, then I guess the answer might be 'yes, in some sense it is like a "real abstraction"'.

Second, and (afaik) currently a more philosophical framing - it isn't obvious whether "(sufficiently advanced) OR logic" and "real abstraction", are actually different. Additionally, for the purposes of a model like this one, I find it hard to see how they could be different. The best the model can do is (roughly speaking) assign neurons to particular concepts, be they ones that fit with our mental models of the world, or ones that are more "functional". The better a job it can do of the former, the more we might be inclined to believe that it is modelling things as we understand them.

link

chromanoid 1935 days ago

A long time ago I implemented a Neocognitron. As far as I understand most deep learning models are very similar in architecture. My Neocognitron could recognize 1 and 0 (grey scale). I handcrafted the weights and hierarchy (as you do it with a normal beginner Neocognitron). When you wrote a 1 into the hole of the 0 or multiple 0s next to a 1, you could see how it weighted sometimes the 1 or the 0 higher. It's not very far fetched to add an "OR" layer that can combine patterns of 1 and 0. Yet I wouldn't call this layer multi-modal. I guess in case of CLIP due to the training method and the output alignment that this kind of OR layer simply occurs in some cases.

When I talked about OR logic I meant a simple and direct aggregation of all possible pixel soups that should be labeled in a specific way. Calling this multimodal and comparing it to the Jennifer Aniston neuron is framing the situation in a way that is not good for science (see AI winter). Especially when multimodal neurons in neuroscience refer to processing more than one sense.

I find it hard to imagine a different solution than sufficiently advanced OR logic for real abstraction within the current ANN models. That does not mean there is none within ANNs and especially not in the brain. We are far from being able to do anything like the brain does and don't know how it learns new concepts so fast - without thousands of sample images. Just because we observe an effect that reminds us of some vague findings in the brain and we call the effect the same in ANNs does not make ANNs more like brains.

BTW nice podcast that also discusses the Jennifer Aniston neuron with its "discoverer": https://soundcloud.com/neuropodcast/episode62

link