| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by raindear 211 days ago
	There has been a slow progress in computer vision in the last ~5 years. We are still not close to human performance. This is in contrast to language understanding which has been solved - LLMs understand text on a human level (even if they have other limitations). But vision isn't solved. Foundation models struggle to segment some objects, they don't generalize to domains such as scientific images, etc. I wonder what's missing with models. We have enough data in videos. Is it compute? Is the task not informative enough? Do we need agency in 3D?

2 comments

tarsinge 211 days ago

I’m not an expert in the field but intuitively from my own experience I’d say what’s missing is a world model. By trying to be more conscious about my own vision I’ve started to notice how common it is that I fail to recognize a shape and then use additional knowledge, context and extrapolations to deduce what it can be.

A few examples I encountered recently: If I take a picture of my living room many random object would be impossible to identify by a stranger but easy by the household members. Or when driving, say at night I see a big dark shape coming from the side of the road? If I’m a local I’ll know there are horses in that field and it is fenced, or I might have read a warning sign before that’ll make me able to deduce what I’m seeing a few minutes later.

People are usually not conscious about this but you can try to block the additional informations to only see and process only what’s really coming from your eyes, and realize how soon it gets insufficient.

link

patates 210 days ago

> If I take a picture of my living room many random object would be impossible to identify by a stranger but easy by the household members.

Uneducated question so may sound silly: A sufficiently complex vision model must have seen a million living rooms and random objects there to make some good guesses, no?

link

parineum 211 days ago

> LLMs understand text on a human level (even if they have other limitations).

Limitations like understanding...

link

visioninmyblood 211 days ago

The problem is the data. LLM data is self supervised. Vision data is very sparsly annotated in the real world. Going a step further robotics data is is much sparser. So getting these models to improve on this long tail distribution will take time.

link