Hacker News new | ask | show | jobs
by hedora 1113 days ago
Monocular cameras are a strange strawman. Is anyone seriously considering them?

Binocular cameras provide absolute depth information, and are an order of magnitude cheaper sensors than the other options.

Since this technology is clearly computationally limited, you should subtract the budget for the sensors from the budget for the computation.

According to the article, the non-camera sensors are in the $1000’s per car range, so the question becomes whether a camera system with an extra $2000 of custom asic / gpu / tpu compute is safer than a computationally-lighter system with a higher bandwidth sensor feed.

I’m guessing camera systems will be the safest economically-viable option, at least until the compute price drops to under a few hundred dollars.

So, assuming multi-camera setups really are first to market, the question then is whether the exotic sensors will ever be able to justify their cost (vs the safety win from adding more cameras and making the computer smarter).

4 comments

It is not a strawman. Tesla FSD, in all forms, exclusively uses monocular cameras.

As seen on their website [1], and confirmed numerous times, they have monocular vision around the car and, though having three front-facing cameras, they each have different focal lengths and are located next to each other and thus can not operate as binocular vision.

[1] https://www.tesla.com/en_AE/autopilot/

Wow. I had no idea they were so far behind. I guess it should have been obvious from their miles between disengagement stats.
Tesla is so expensive and they can't install another camera. why...
At the risk of stating the obvious, stereovision in practice has a few interesting challenges. Yes, the main formula is deceptively simple: d = b*f / D (d - depth, D - disparity, b - baseline, f - focal length), but in practice, all 3 terms on the right require some thinking. The most difficult is D - disparity, it usually comes from some sort of feature matching algorithm, whether traditional or ML-based. Such algorithms usually require some texture surfaces to work properly, so if the surface does not have "enough" texture (example would be a gray truck in front of the cameras), then the feature matching will work poorly. In CV research there are other simplifying assumptions being made so that epipolar constraints make the task simpler. Examples of these assumptions are coplanar image planes, epipolar lines being parallel to a line connecting focal points and so on. In practice, these assumptions are usually wrong, so you need, for example, to rectify the images which is an interesting task by itself. Additionally, baseline b can drift due to changes in temperature and mechanical vibrations. So is the focal length f, so automatic camera calibration is required (not trivial).

Don't forget some interesting scenarios like dust particles or mud on one of the cameras (or windshield if cameras are located behind the windshield) or rain beading and distorting the image thus breaking the feature matcher and resulting disparity estimates.

Next, to "see" further, a stereo rig needs to have a decent baseline. For example, in a classic KITTI dataset, the baseline is approximately 0.54m which is much larger than, for example, human eyes (0.065m). Such baseline, 54cm, together with focal length, which, if I remember correctly, is about 720px in case of KITTI vehicle cameras, would give about 388m in the ideal case of being able to detect 1 pixel disparity. But detecting 1px of D is very difficult in practice - don't forget you will be running your algo on a car with limited compute resources. Say, you can have around 5px of D, that means max depth of around 77m - comparable to older Velodyne LiDARs.

Some of the issues I mentioned are not specific to stereovision (e.g. you need to calibrate monocular cameras as well and so on), just wanted to point out that stereovision does not magically enable depth perception. The solution would likely be a combination of monocular and stereo cameras, combined with SfM (Structure from Motion) and depth-from-stereo algorithms.

Isn't binocular information only useful for objects 10m ahead or closer? At least according to Hacker News, the most reliable source of information on the internet: https://news.ycombinator.com/item?id=36182151
This paper suggests that human vision maintains stereopsis much further out than many researchers have thought: “Binocular depth discrimination and estimation beyond interaction space” https://jov.arvojournals.org/article.aspx?articleid=2122030

They measured out to 18m & point out that the typical measured limits of angular resolution of the human eye mean that we could extract stereo image information out to 200m or more.

This paper claims to demonstrate stereopsis out to 250m, which is roughly the limit you’d expect from typical human visual acuity: “Stereoscopic perception of real depths at large distances” https://jov.arvojournals.org/article.aspx?articleid=2191614

This paper suggests that steropsis occurs out to somewhere between 20m & 65m before other cues dominate 3D depth perception: “The Role of Binocular Cues in Human Pilot Landing Control” https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30...

It seems that the claim that stereo vision only occurs in the near field case is probably wrong? Human stereo vision is much more capable than that & if it reaches out to significantly > 20m is surely being used when driving?

I think binocular depth resolution is roughly proportional to the space between the cameras. A car hood is much wider than a human head. I’m not sure how far you can push that without hitting issues with close up stuff.
Depends on the angular resolution of the sensor as well.
Yeah; proportional, assuming the sensor stays constant and hand waving about the FOV of the lens.
> According to the article, the non-camera sensors are in the $1000’s per car range.... I’m guessing camera systems will be the safest economically-viable option, at least until the compute price drops to under a few hundred dollars.

Human life is worth about $10 million (in US), that's a bit more than the sensor does. If one in 10,000 of camera-only car causes deaths, then it's not economically viable.

A London bus costs about $300,000, it is economically viable. Why is $1,000 sensor a problem. It is definitely viable to be installed on busses and trucks. Maybe you need to get out of the mindset of personal cars. It is not a viable business model, and it is not a viable model of dealing with congestion either.

If the car fleet kills half as many people as a human with a $1000/car system, and zero people with a $100,000 system, then we should immediately put the $1000 system on 100 times as many cars as we could put the $100,000 system on. (I picked those numbers because that is the order of magnitude range I have heard self driving car companies quote.)

The $100,000 system would only ever make sense if the fleet was already entirely self driving, and money for other life saving stuff like the environment and health care also hit suitable diminishing returns. Of course, by then, the cheap systems will have improved.

This argument holds for any non-negative dollar value you place on human life.

It is also independent of who owns the vehicles. Money the bus fleet spends in expensive self driving pulls money away from bus stop upgrades, pollution controls, etc, etc.