Hacker News new | ask | show | jobs
by Imnimo 1939 days ago
Yeah, I mean it's definitely the case that the signal here is that words occur in captions for pictures of text as for pictures of objects. But if you do this same experiment of visualizing dataset examples for other networks (like ones trained on Imagenet), you don't find this sort of multimodal neuron.
1 comments

I am not sure if I get it, but isn't the CLIP "multi-modal" neuron just considered as "multi-modal" because it occurs some layers before the actual output? I am not sure, but maybe the indirection is just obfuscation and not a sign of abstraction.