| Sure. You're suggesting that Tesla could get depth perception by placing two identical cameras several inches apart from each other, with an overlapping field of view. I'm just wondering if using cameras that are close to each other, but use different focal lengths, doesn't give the same results. It seems to me that this is how modern phones are doing background removal: The lenses are very close to each other, very unlike the human eye. But they have different focal lengths, so depth can be estimated based on the diff between the images caused by the different focal lengths. Also, wouldn't turning a multitude of views into a 3D map require a neural net anyway? Whether the images differ because of different focal lengths or because of different positions seems to be essentially the same training task. In both cases, the model needs to learn "This difference in those two images means this depth". I think with the human eye, we do the same thing. That's why some optical illusions work that confuse your perception of which objects are in front and which are in the back. And those illusions work even though humans actually have an advantage over cheap fixed-focus cameras, in that focusing the lens on the object itself gives an indication of the object's distance. Much like you could use a DSL as a measuring device by focusing on the object and then checking the distance markers on the lens' focus ring. Tesla doesn't have that advantage. They have to compare two "flat" images. |
I can see why it might seem that way intuitively, but different focal lengths won't give any additional information about depth, just the potential for more detail. If no other parameters change, an increase in focal length is effectively the same as just cropping in from a wider FOV. Other things like depth of field will only change if e.g. the distance between the subject and camera are changed as well.
The additional depth information provided by binocular vision comes from parallax [0].
> Also, wouldn't turning a multitude of views into a 3D map require a neural net anyway?
Not necessarily, you can just use geometry [1]. Stereo vision algorithms have been around since the 80s or earlier [2]. That said, machine learning also works and is probably much faster. Either way the results should in theory be superior to monocular depth perception through ML, since additional information is being provided.
> It seems to me that this is how modern phones are doing background removal: The lenses are very close to each other, very unlike the human eye. But they have different focal lengths, so depth can be estimated based on the diff between the images caused by the different focal lengths.
Like I said, there isn't any difference when changing focal length other than 'zooming'. There's no further depth information to get, except for a tiny parallax difference I suppose.
Emulation of background blur can certainly be done with just one camera through ML, and I assume this is the standard way of doing things although implementations probably vary. Some phones also use time-of-flight sensors, and Google uses a specialised kind of AF photosite to assist their single sensor -- again, taking advantage of parallax [3]. Unfortunately I don't think the Tesla sensors have any such PDAF pixels.
This is also why portrait modes often get small things wrong, and don't blur certain objects (e.g. hair) properly. Obviously such mistakes are acceptable in a phone camera, less so in an autonomous car.
> And those illusions work even though humans actually have an advantage over cheap fixed-focus cameras, in that focusing the lens on the object itself gives an indication of the object's distance
If you're referring to differences in depth of field when comparing a near vs far focus plane, yeah that information certainly can be used to aid depth perception. Panasonic does this with their DFD (depth-from-defocus) system [4]. As you say though, not practical for Tesla cameras.
[0] https://en.wikipedia.org/wiki/Binocular_disparity [1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.36... [2] https://www.ri.cmu.edu/pub_files/pub3/lucas_bruce_d_1981_2/l... [3] https://ai.googleblog.com/2017/10/portrait-mode-on-pixel-2-a... [4] https://www.dpreview.com/articles/0171197083/coming-into-foc...