|
|
|
|
|
by sendtown_expwy
1938 days ago
|
|
If the text encoder's representation of "spiderman" and "spider" are close together, then the image encoder is penalized if it makes the image representations far apart from each other. So what we are witnessing here could simply be the neuron that says "I think 'spider' appears in the caption'". Of course it could be that the network understands the metaphorical relationship of spiderman to spiders, but the simpler explanation seems more plausible to me. |
|