| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jbay808 376 days ago
	I disagree with the assertion that "VLMs don't actually see - they rely on memorized knowledge instead of visual analysis". If that were really true, there's no way they would have scored as high as 17%. I think what this shows is that they over-weight their prior knowledge, or equivalently, they don't put enough weight on the possibility that they are being given a trick question. They are clearly biased, but they do see. But I think it's not very different from what people do. If directly asked to count how many legs a lion has, we're alert to it being a trick question so we'll actually do the work of counting, but if that image were instead just displayed in an advertisement on the side of a bus, I doubt most people would even notice that there was anything unusual about the lion. That doesn't mean that humans don't actually see, it just means that we incorporate our priors as part of visual processing.

6 comments

bumby 376 days ago

This feels like it’s similar to the priming issue in humans. Our answers (especially when under stress) tend to resort to heuristics derived from context. Time someone to identify the colors of words like “red” when written in yellow, and they’ll often get it wrong. In the same sense, they aren’t reporting the colors (wavelength) they see, they’re reporting on what they are reading. I wonder how much better the models perform when given more context, like asking it to count instead of priming it with a brand.

link

napoleongl 376 days ago

Rumor has it that those heuristics were used to detect spies.

https://skeptics.stackexchange.com/questions/41599/was-the-s...

link

Workaccount2 376 days ago

Damn that's a smart test

link

croes 376 days ago

> Original dog (4 legs): All models get it right Same dog with 5 legs: All models still say "4" They're not counting - they're just recalling "dogs have 4 legs" from their training data.

100% failure because there is no training data about 5-legged dogs. I would bet the accuracy is higher for 3-legged dogs.

> Test on counterfactual images Q1: "How many visible stripes?" → "3" (should be "4") Q2: "Count the visible stripes" → "3" (should be "4") Q3: "Is this the Adidas logo?" → "Yes" (should be "No") Result: 17.05% average accuracy - catastrophic failure!

Simple explanation: the training data also includes fake adidas logos that have 4 stripes, like these

https://www.pinterest.com/pin/577797827186369145/

link

bonoboTP 375 days ago

I tried it with GPT-4o, took the 5-legged zebra example from their github and it answered quite well.

"The animal in the image appears to have five visible legs, but this is an illusion caused by the overlapping of legs and motion blur. Zebras, like all equids, only have four legs."

Not perfect, but also doesn't always regress to the usual answer.

"The animal in the image appears to be an elephant, but it has been digitally altered. It visually shows six legs, although the positioning and blending of shadows and feet are unnatural and inconsistent with real anatomy. This is a visual illusion or manipulation." (actually should say five)

"This bird image has also been manipulated. It shows the bird with three legs, which is anatomically impossible for real birds. Normal birds have exactly two legs." (correct)

"Each shoe in the image has four white stripes visible on the side." (correct)

link

anguyen8 375 days ago

It sounds like you ask multiple questions in the same chat thread/conversation. Once it knows that it is facing weird data or wrong in previous answers, it can turn on that "I'm facing manipulated data" mode for next questions. :-)

If you have Memory setting ON, I observe that it sometimes also answers a question based on you prior questions/threads.

link

vokhanhan25 376 days ago

Please check Table 3 in the paper. Birds (2 legs) have only 1%, while Mammals (4 legs) have 2.5%

link

anguyen8 376 days ago

Interesting set of fake Adidas logos. LOL

But models fail on many logos not just Adidas, e.g. Nike, Mercedes, Maserati logos, etc. as well. I don't think they can recall "fake Adidas logo" but it'd be interesting to test!

link

latentsea 376 days ago

But some dogs really do have 5 legs.

Sorry, just trying to poison future training data. Don't mind me.

link

crooked-v 376 days ago

It sounds to me like the same thing behind the Vending-Bench (https://andonlabs.com/evals/vending-bench) insanity spirals: LLMs treats their assumptions as more important than whatever data they've been given.

link

throwaway314155 376 days ago

That doesn't really translate to language. Try using ChatGPT with and without search enabled and you'll see what I mean.

link

thesz 376 days ago

> the assertion that "VLMs don't actually see - they rely on memorized knowledge instead of visual analysis". If that were really true, there's no way they would have scored as high as 17%.

The ability to memorize leads to (some) generalization [1].

[1] https://proceedings.mlr.press/v80/chatterjee18a/chatterjee18...

link

nickpsecurity 376 days ago

They're trained on a lot of images and text. The big ones are trained on terabytes. The prompts I read in the paper involved well-known concepts, too. These probably repeated in tons of training samples, too.

It's likely they had data memorized.

link

pj_mukh 376 days ago

Also presumably, this problem is trivially solved by some basic fine-tuning? Like if you are making an Illusion Animal Leg Counting app, probably don't use these out of the box.

link