| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by m3at 2542 days ago

We don't just have 2D data though.

We learn objects representations by interacting with them over years in a multi modal fashion. Take for example a simple drinking glass: we know its material properties (it is transparent, solid, can hold liquids), its typical position (stay on a tabletop, upright with the open side on top), its usage (grab it with a hand and bring to mouth)...

We also make heavy use of the time dimension, as over a few seconds we see the same objects from different view points and possibly in different states.

Only after learning what a glass is can we easily recover its properties on a still 2D image.

So at least for learning (might be skippable at inference), it makes a lot of sense to me to have more than 2D still images.

1 comments

ebg13 2542 days ago

You're not responding to what they said. The person you're responding to is talking about depth from stereo, not cognition. Lidar _also_ doesn't know what the glass feels like.

link

m3at 2541 days ago

I am, I was not writing about cognition here.

All I'm saying is that even with stereo inputs, we're doing more than computing depth from the baseline between left/right images. Close one eye and you can still estimate relative objects positions, because you learned that roads are mostly planar and cars don't float but stand on the road. You know what the expected size of a car is compared to, say, a human, and if the car is visually smaller than the human, it must be more far away.

Lidar _also_ doesn't know what the glass feels like.

Yes I agree with you, lidar and most current vision sensors also suffer from this.

link

semi-extrinsic 2541 days ago

People who have good vision in one eye can usually get their drivers licence without problems. So the depth from stereo is not a necessary part of driving for humans.

link

ebg13 2541 days ago

It doesn't matter how you estimate depth, but you do have to estimate it to drive, and the first step before you can estimate is that your eyes (eye in your example) need to see pictures. Light entering the eye is an entirely different stage in the process than reasoning about said light.

link