| HN Mirror

A long time ago I implemented a Neocognitron. As far as I understand most deep learning models are very similar in architecture. My Neocognitron could recognize 1 and 0 (grey scale). I handcrafted the weights and hierarchy (as you do it with a normal beginner Neocognitron). When you wrote a 1 into the hole of the 0 or multiple 0s next to a 1, you could see how it weighted sometimes the 1 or the 0 higher. It's not very far fetched to add an "OR" layer that can combine patterns of 1 and 0. Yet I wouldn't call this layer multi-modal. I guess in case of CLIP due to the training method and the output alignment that this kind of OR layer simply occurs in some cases.

When I talked about OR logic I meant a simple and direct aggregation of all possible pixel soups that should be labeled in a specific way. Calling this multimodal and comparing it to the Jennifer Aniston neuron is framing the situation in a way that is not good for science (see AI winter). Especially when multimodal neurons in neuroscience refer to processing more than one sense.

I find it hard to imagine a different solution than sufficiently advanced OR logic for real abstraction within the current ANN models. That does not mean there is none within ANNs and especially not in the brain. We are far from being able to do anything like the brain does and don't know how it learns new concepts so fast - without thousands of sample images. Just because we observe an effect that reminds us of some vague findings in the brain and we call the effect the same in ANNs does not make ANNs more like brains.

BTW nice podcast that also discusses the Jennifer Aniston neuron with its "discoverer": https://soundcloud.com/neuropodcast/episode62