If its deep in the network, though, couldn't you interpret it as a single neuron responsible for the substring "spider" in the caption? That seems less surprising. But, I see your point.
Yeah, I mean it's definitely the case that the signal here is that words occur in captions for pictures of text as for pictures of objects. But if you do this same experiment of visualizing dataset examples for other networks (like ones trained on Imagenet), you don't find this sort of multimodal neuron.
I am not sure if I get it, but isn't the CLIP "multi-modal" neuron just considered as "multi-modal" because it occurs some layers before the actual output? I am not sure, but maybe the indirection is just obfuscation and not a sign of abstraction.