|
|
|
|
|
by jbay808
376 days ago
|
|
I disagree with the assertion that "VLMs don't actually see - they rely on memorized knowledge instead of visual analysis". If that were really true, there's no way they would have scored as high as 17%. I think what this shows is that they over-weight their prior knowledge, or equivalently, they don't put enough weight on the possibility that they are being given a trick question. They are clearly biased, but they do see. But I think it's not very different from what people do. If directly asked to count how many legs a lion has, we're alert to it being a trick question so we'll actually do the work of counting, but if that image were instead just displayed in an advertisement on the side of a bus, I doubt most people would even notice that there was anything unusual about the lion. That doesn't mean that humans don't actually see, it just means that we incorporate our priors as part of visual processing. |
|