| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eightails 1603 days ago

Sure, but they're not getting that 3d map from binocular vision. The forward camera sensors are within a few mm of each other and different focal lengths.

And the tweet thread you linked confirms it's a ML depth map:

> Well, the cars actually have a depth perceiving net inside indeed.

My speculation was that a binocular system might be less prone to error than the current net.

1 comments

leobg 1603 days ago

Sure. You're suggesting that Tesla could get depth perception by placing two identical cameras several inches apart from each other, with an overlapping field of view.

I'm just wondering if using cameras that are close to each other, but use different focal lengths, doesn't give the same results.

It seems to me that this is how modern phones are doing background removal: The lenses are very close to each other, very unlike the human eye. But they have different focal lengths, so depth can be estimated based on the diff between the images caused by the different focal lengths.

Also, wouldn't turning a multitude of views into a 3D map require a neural net anyway?

Whether the images differ because of different focal lengths or because of different positions seems to be essentially the same training task. In both cases, the model needs to learn "This difference in those two images means this depth".

I think with the human eye, we do the same thing. That's why some optical illusions work that confuse your perception of which objects are in front and which are in the back.

And those illusions work even though humans actually have an advantage over cheap fixed-focus cameras, in that focusing the lens on the object itself gives an indication of the object's distance. Much like you could use a DSL as a measuring device by focusing on the object and then checking the distance markers on the lens' focus ring. Tesla doesn't have that advantage. They have to compare two "flat" images.

link

eightails 1603 days ago

> I'm just wondering if using cameras that are close to each other, but use different focal lengths, doesn't give the same results

I can see why it might seem that way intuitively, but different focal lengths won't give any additional information about depth, just the potential for more detail. If no other parameters change, an increase in focal length is effectively the same as just cropping in from a wider FOV. Other things like depth of field will only change if e.g. the distance between the subject and camera are changed as well.

The additional depth information provided by binocular vision comes from parallax [0].

> Also, wouldn't turning a multitude of views into a 3D map require a neural net anyway?

Not necessarily, you can just use geometry [1]. Stereo vision algorithms have been around since the 80s or earlier [2]. That said, machine learning also works and is probably much faster. Either way the results should in theory be superior to monocular depth perception through ML, since additional information is being provided.

> It seems to me that this is how modern phones are doing background removal: The lenses are very close to each other, very unlike the human eye. But they have different focal lengths, so depth can be estimated based on the diff between the images caused by the different focal lengths.

Like I said, there isn't any difference when changing focal length other than 'zooming'. There's no further depth information to get, except for a tiny parallax difference I suppose.

Emulation of background blur can certainly be done with just one camera through ML, and I assume this is the standard way of doing things although implementations probably vary. Some phones also use time-of-flight sensors, and Google uses a specialised kind of AF photosite to assist their single sensor -- again, taking advantage of parallax [3]. Unfortunately I don't think the Tesla sensors have any such PDAF pixels.

This is also why portrait modes often get small things wrong, and don't blur certain objects (e.g. hair) properly. Obviously such mistakes are acceptable in a phone camera, less so in an autonomous car.

> And those illusions work even though humans actually have an advantage over cheap fixed-focus cameras, in that focusing the lens on the object itself gives an indication of the object's distance

If you're referring to differences in depth of field when comparing a near vs far focus plane, yeah that information certainly can be used to aid depth perception. Panasonic does this with their DFD (depth-from-defocus) system [4]. As you say though, not practical for Tesla cameras.

[0] https://en.wikipedia.org/wiki/Binocular_disparity [1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.36... [2] https://www.ri.cmu.edu/pub_files/pub3/lucas_bruce_d_1981_2/l... [3] https://ai.googleblog.com/2017/10/portrait-mode-on-pixel-2-a... [4] https://www.dpreview.com/articles/0171197083/coming-into-foc...

link

bumby 1603 days ago

>different focal lengths won't give any additional information about depth, just the potential for more detail.

This is also why some people will optimize each eye for different focal length when getting laser eye surgery. When your lens is too stiff from age, it won't provide any additional depth perception but will give you more detail at different distances.

link

leobg 1597 days ago

Wow. Ok. I did not know that. I thought that there is depth information embedded in the diff between the images taken at different focal lengths.

I'm still wondering. As a photographer, you learn that you always want to use a focal length of 50mm+ for portraits. Otherwise, the face will look distorted. And even a non-photographer can often intuitively tell a professional photo from an iPhone selfie. The wider angle of the iPhone selfie lens changes the geometry of the face. It is very subtle. But if you took both images and overlayed them, you see that there are differences.

But, of course, I'm overlooking something here. Because if you take the same portrait at 50mm and with, say, 20mm, it's not just the focal length of the camera that differs. What also differs is the position of each camera. The 50mm camera will be positioned further away from the subject, whereas the 20mm camera has to be positioned much closer to achieve the same "shot".

So while there are differences in the geometry of the picture, these are there not because of the difference in the lenses being used, but because of the difference in the camera-subject distance.

So now I'm wondering, too, why Tesla decided against stereo vision.

It does seem, though, that they are getting that depth information through other means:

Tesla 3D point cloud: https://www.youtube.com/watch?v=YKtCD7F0Ih4

Tesla 3D depth perception: https://twitter.com/sendmcjak/status/1412607475879137280?s=6...

Tesla 3D scene reconstruction: https://twitter.com/tesla/status/1120815737654767616

Perhaps it helps that the vehicle moves? That is, after all, very close to having the same scene photographed by cameras positioned at different distances. Only that Tesla uses the same camera, but has it moving.

Also, among the front-facing cameras, the two outermost are at least a few centimeters apart. I haven't measured it, but it looks like a distance not unlike between a human's eyes [0]. Maybe that's already enough?

[0] https://www.notateslaapp.com/images/news/2022/camera-housing...

link

eightails 1591 days ago

> But, of course, I'm overlooking something here. Because if you take the same portrait at 50mm and with, say, 20mm, it's not just the focal length of the camera that differs. What also differs is the position of each camera. The 50mm camera will be positioned further away from the subject, whereas the 20mm camera has to be positioned much closer to achieve the same "shot".

Yep, totally.

> Perhaps it helps that the vehicle moves? That is, after all, very close to having the same scene photographed by cameras positioned at different distances.

I think you're right, they must be taking advantage of this to get the kind of results they are getting. That point cloud footage is impressive, it's hard to imagine getting that kind of detail and accuracy just from individual 2d stills.

Maybe this also gives some insight into the situations where the system seems to struggle. When moving forward in a straight line, objects in the peripheral will shift noticeably in relative size, position and orientation within the frame, whereas objects directly in front will only change in size, not position or orientation. You can see this effect just by moving your head back and forth.

So it might be that the net has less information to go on when considering objects stationary directly in or slightly adjacent to the vehicles path -- which seems to be one of the scenarios where it makes mistakes in the real world, e.g. with stationary emergency vehicles. I'm just speculating here though.

> Also, among the front-facing cameras, the two outermost are at least a few centimeters apart. I haven't measured it, but it looks like a distance not unlike between a human's eyes [0]. Maybe that's already enough?

Maybe. The distance between the cameras is pretty small from memory, less than in human eyes I would say. It would also only work over a smaller section of the forward view due to the difference in focal length between the cams. I can't help but think that if they really wanted to take advantage of binocular vision, they would have used more optimal hardware. So I guess that implies that the engineers are confident that what they have should be sufficient, one way or another.

link