Hacker News new | ask | show | jobs
by mrow84 1934 days ago
If the trained network produces common labels for sets of pixel soups that we consider to be semantically related, but are not visually related, then that is interesting.
1 comments

I agree. But isn't it more probable that there forms some kind of arbitrary "OR" logic than "real abstraction" which is indicated by choosing the word "multi-modal".

I guess we see something like this:

  e.g. "Photo of spider" -> Hierarchy of pixel soups -> "Photo of spider" OR "Photo of word spider" OR "Spider rear view" OR "Spiderman" OR ... -> [Spider]
What I think the authors want to tell me when calling it multi-modal:

  "Photo of spider" -> "Characteristics of a spider" -> [Spider]
  "Photo of word spider" -> "Letters S-P-I-D-E-R" -> "Written word spider" -> [Spider]
I guess there are two ways of looking at that question.

One is just basic generalisation - do these neurons effectively capture things within their semantic group (whatever that means) but completely outside of the training data. If yes, then I guess the answer might be 'yes, in some sense it is like a "real abstraction"'.

Second, and (afaik) currently a more philosophical framing - it isn't obvious whether "(sufficiently advanced) OR logic" and "real abstraction", are actually different. Additionally, for the purposes of a model like this one, I find it hard to see how they could be different. The best the model can do is (roughly speaking) assign neurons to particular concepts, be they ones that fit with our mental models of the world, or ones that are more "functional". The better a job it can do of the former, the more we might be inclined to believe that it is modelling things as we understand them.

A long time ago I implemented a Neocognitron. As far as I understand most deep learning models are very similar in architecture. My Neocognitron could recognize 1 and 0 (grey scale). I handcrafted the weights and hierarchy (as you do it with a normal beginner Neocognitron). When you wrote a 1 into the hole of the 0 or multiple 0s next to a 1, you could see how it weighted sometimes the 1 or the 0 higher. It's not very far fetched to add an "OR" layer that can combine patterns of 1 and 0. Yet I wouldn't call this layer multi-modal. I guess in case of CLIP due to the training method and the output alignment that this kind of OR layer simply occurs in some cases.

When I talked about OR logic I meant a simple and direct aggregation of all possible pixel soups that should be labeled in a specific way. Calling this multimodal and comparing it to the Jennifer Aniston neuron is framing the situation in a way that is not good for science (see AI winter). Especially when multimodal neurons in neuroscience refer to processing more than one sense.

I find it hard to imagine a different solution than sufficiently advanced OR logic for real abstraction within the current ANN models. That does not mean there is none within ANNs and especially not in the brain. We are far from being able to do anything like the brain does and don't know how it learns new concepts so fast - without thousands of sample images. Just because we observe an effect that reminds us of some vague findings in the brain and we call the effect the same in ANNs does not make ANNs more like brains.

BTW nice podcast that also discusses the Jennifer Aniston neuron with its "discoverer": https://soundcloud.com/neuropodcast/episode62