|
|
|
|
|
by raindear
211 days ago
|
|
There has been a slow progress in computer vision in the last ~5 years. We are still not close to human performance. This is in contrast to language understanding which has been solved - LLMs understand text on a human level (even if they have other limitations). But vision isn't solved. Foundation models struggle to segment some objects, they don't generalize to domains such as scientific images, etc. I wonder what's missing with models. We have enough data in videos. Is it compute? Is the task not informative enough? Do we need agency in 3D? |
|
A few examples I encountered recently: If I take a picture of my living room many random object would be impossible to identify by a stranger but easy by the household members. Or when driving, say at night I see a big dark shape coming from the side of the road? If I’m a local I’ll know there are horses in that field and it is fenced, or I might have read a warning sign before that’ll make me able to deduce what I’m seeing a few minutes later.
People are usually not conscious about this but you can try to block the additional informations to only see and process only what’s really coming from your eyes, and realize how soon it gets insufficient.