Hacker News new | ask | show | jobs
by ikhatri 1112 days ago
As someone who works in the industry (disclaimer these are my own views and don't reflect those of my employer), something about the framing of this article rubs me the wrong way despite the fact that it's mostly on point. Yes it is true that different companies are choosing different sensing solutions based on cost and the ODD in which they must operate. But I think this last sentence just left a sour taste in my mouth "But the verdict is still out as to which is safer".

It is not an open question and I hate it when writers frame it this way. Camera only (specifically, monocular camera only) systems literally cannot be safer than ones with sensor fusion right now. This may change in the future at some point, but it's not a question right now it is a fact.

Setting aside comparisons to humans for a second (will get back to this), monocular cameras can only provide relative depth. You can guess the absolute depth with your neural net but the estimates are pretty garbage. Unfortunately, robots can't work/plan with this input. The way any typical robotics stack works is that it relies on an absolute/measured understanding of the world in order to make its plans.

That isn't to say that one day with sufficiently powerful ML and better representations we would be totally unable to use mono (relative) depth. People argue that humans don't really use our stereoscopic depth past ~10m or so and that's a fair point. But we also don't plan the way robots do. We don't require accurate measurements of distance and size. When you're squeezing your car into a parking spot you don't measure your car and then measure the spot to know if it'll fit. You just know. You just do it. And it's a guesstimate (so sometimes humans make mistakes and we hit stuff). Robots don't work this way (for now), so their sensors cannot work this way either (for now).

12 comments

Self driving isn't a sensor problem, its a software problem.

From how humans drive, its pretty clear that there exists some latent space representation of immediate surroundings inside our brains that doesn't require a lot of data. If you had a driving sim wheel and 4 monitors for each direction + 3 smaller ones for rear view mirror, connected to a real world car with sufficiently high definition cameras, you could probably drive the car remotely as well as you could in real life, all because the images would map to the same latent space.

But the advantage that humans have is that we have an innate understanding of basic physics from experience in interacting with the world, which we can deduce from something simple as a 2d representation, and that is very much a big part of that latent space. You wouldn't be able to drive a car if you didn't have some "understanding" of things like velocity, acceleration, object collision, e.t.c

So my bet is that just like with LLMs, there will be research published at some point that given certain frames in a video, it will be able to extrapolate the physical interactions that will occur, including things like collision, relative distances, and so on. Once that is in place, self driving systems will get MASSIVELY better.

It's both. Your eyes have much better dynamic range and FPS than modern self driving systems & cameras. If you can reduce the amount of guessing your robot does (e.g. laser says _with certainty_ that you'll collide with an object ahead), you should do it.

Self-driving is still a robotics problem, and robots are probablistic operators with many component dependencies. If you have 3 99% reliable systems strung together running 24 hours a day, that's 43 minutes a day that it will be unreliable ((1 - .99^3)*1440). Multi-modality allows your systems to provide redundancy for one another and reduce the accumulating correlated errors.

> Your eyes have much better dynamic range and FPS than modern self driving systems & cameras.

Eh, kind of...

https://youtu.be/HU6LfXNeQM4?t=1987

Check out this NOVA video on how limited your acute vision actually is. It is only by rapidly moving our eyes around that we have high quality vision. In the places you are not looking your brain is computing what it thinks is happening, not actually watching it.

I should have said eyes+brain in combination have much better dynamic range and FPS perception than self driving systems. Point remains unchanged -- what sensor you use is tied to the computation you need to do. What you see is the sum of computation+sensor so it's impossible for sensor not to matter.

Tangential: event cameras work more like our eyes but aren't ready for AVs yet.

It's only "kind of" if they compensate for the reduced specs. As the root commenter said, they don't compensate yet. It's just less safe in those situations.

Whether it's fine to be less safe in certain situations because it's safer overall is a different question.

> In the places you are not looking your brain is computing what it thinks is happening, not actually watching it.

The existence of peripheral vision disputes that pretty definitively, though.

I do recommend that you stop and watch the video first to understand better what's going on there....
I tried but it's not available in my country, sadly.
> Your eyes have much better dynamic range and FPS than modern self driving systems & cameras. If you can reduce the amount of guessing your robot does (e.g. laser says _with certainty_ that you'll collide with an object ahead), you should do it.

You could drive fine at 30fps on a regular monitor (SDR). More fps would help with aggressive/sporty driving of course.

> You could drive fine at 30fps on a regular monitor (SDR). More fps would help with aggressive/sporty driving of course.

What? This is preposterous.

Have you tried playing a shooter video game at 30 FPS? It's atrocious, you get rekt. There is a reason all gamers are getting 120 FPS and up.

30 FPS means 33 ms of latency. Driving on a highway, car moves over a meter before the camera even detects an obstacle. The display has it's own input lag, so does the operating system. Your total latency is going to be over 100ms, so the car will have travelled several meters. If a motorcyclist in front of you falls, you will feel the car crashing into his body before the image even appears on the screen.

There's plenty of FPS racing games that you can play just fine at 30FPS. Obviously more FPS is a better experience, but it's not like it becomes impossible to drive.

Also, if you truly are only a few meters behind a motorcyclist when driving at highway speeds, by definition you are being unsafe. The rule I learned in driving school was roughly 1 car length per 10mph of space, so you should be ~90 feet (~30 meters) away.

Finally, the average reaction time for people driving in real life is something like 3/4 of a second. 750ms to transition from accelerating to braking. A self-driving car being able to make decisions in the 100ms time frame is FAR superior.

I agree this is preposterous but one nit to pick: event loops on self driving cars are really that slow, and they must use very good behavior prediction + speculative reasoning to deal with scenarios like the one you described.
oh dear
Have you tried doing this in the dark? Have you tried spotting the little arrow in the green traffic light that says you can turn left, consistently, in your video feed even facing a low sun?
Only if that monitor was hooked up to a camera that could dynamically adjust its gain to achieve best possible image contrast in everything from bright sunlight to moonlit night.

You’d also lose depth perception entirely, which can’t be good for your driving.

You can test this pretty easily, it's not like that model doesn't exist. Play your average driving videogame at 30fps in first-person mode. Crank up the brightness until you can barely see if you like. We do it just fine because the model exists in our head, not because there's some inherent perfection in our immediate sensing abilities.
Yeah. I mean you're right and wrong at the same time imo. I won't hypothesize about how humans drive. I think for the most part it's a futile exercise and I'll leave that to the people who have better understanding of neuroscience. (I hate when ML/CS people pretend to be experts at everything).

That being said, this idea of a latent space representation of the world is the right tree to be barking up (imo). The problem with "scale it like an LLM" right now is that 3D scene understanding (currently) requires labels. And LLMs scale the way they do because they don't require labels. They structure the problem as next token prediction and can scale up unsupervised (their state space/vocabulary is also much smaller). And without going into too much detail, myself (and others I know in this field) are actively doing research to resolve these issues so perhaps we really will get there someday.

Until then however. Sensors are king, and anyone selling you "self-driving" without them is lying to you :)

Correction: anyone selling you "self driving" is lying to you.

We're at least a decade away from it... (and yes, I've seen the current batch of FSD videos).

The only one literally selling it, is Mercedes. What is wrong with it? Don’t you consider it „self driving“?

https://media.mbusa.com/releases/mercedes-benz-worlds-first-...

I think you may be over-indexing on the word "selling". I didn't mean it literally as in for sale to you (the customer) directly. That is what Tesla FSD is claiming and I agree with you that we're some indeterminate amount of time away from it.

However Waymo, Cruise and others do exist. If you haven't already, check out JJRicks videos on YouTube. I think you might be changing the number of years in your estimation ;)

Each time I see functional FSD it is in a very specific and limited scope. Simple thing that ultra precise maps, low speed, good roads, suitable climate, and a system that can just bail and stop the car are common themes. I would also be interested to hear if places with waymo have traffic rules where pedestrians/cyclists have priority without relying on traffic signs.
> if you had a driving sim wheel and 4 monitors for each direction + 3 smaller ones for rear view mirror, connected to a real world car with sufficiently high definition cameras, you could probably drive the car remotely as well as you could in real life, all because the images would map to the same latent space.

I disagree. When in a car, we are using more than our eyes. We have sound as well, of course, something that we provide feedback even in the quietest cars. We also have the ability to feel vibration, gravity and acceleration. Sitting in a sim without at least some of these additional forms of feedback would be a different skill.

There was an even where they took the top iRacing sim driver and put him in a real F1 car and he was able to do VERY well in terms of lap times.

There was another even where they took another sim driver and put him in a real drift car, and he was able to drift very well.

Both vids are on youtube. Yes, real world driving has more variables, and yes, the racing drivers had force feedback wheels, but in general, if a person is able to control a car so well as to put the virtual wheel in the right square foot of the virtual track to take a corner optimally, its probably likely that most people could drive very well solely from visual feedback. Sound and IMUs can provide additional correctional information, but the key point still remains, is that whatever software runs has to deduce physics from visual images.

Would you say your examples are moving the sim driver from fewer sensors (an abstraction of driving) to more sensors (the real world)?
Driving sims obviously have sound, and also feedback through the steering wheel (sometimes also the seat).

Self driving cars obviously have microphones and accelerometers too.

https://youtu.be/HU6LfXNeQM4

I recommend watching this NOVA video on human perception. When doing any number of task, especially ones we do commonly we're using a ton of unconscious perception and prediction based upon our internal representation of physics and human modeling.

For example when I was younger I was noticing that I was commonly aware that a car was going to get over before it did so. I kept an eye out trying to determine why this was the case and I noticed two things. One is people commonly turn their head and check the mirrors before they even signal to get over. The other is they'll make a slight jerk of the wheel in the direction before making the lane change.

This assertion: Self driving isn't a sensor problem, its a software problem. is hard to support today. Your human vision analogy leaves out a lot of both sensor and processing differences between what we call machine vision and human vision.

Even if parity with human vision can be attained, humans kill 42,000 other American humans each year on the roads. If human driven cars were invented today, and pitched as killing only 42,000 people per year, the inventor would get thrown into a special prison for supervillains.

> Self driving isn't a sensor problem, its a software problem.

Taking things to the extreme, perhaps it’s actually a networking problem.

Cars should have the ability to send signals and hear signals from the cars around it.

Imagine if Car A could improve its own understanding of the environment using inputs/sensor data from nearby Car B.

Not much would change. The idiotic idea of removing traffic lights in favor of self driving cars zipping past each other forgets about those pesky pedestrians we should be designing cities for.
When I wrote the comment, I was envisioning the current world, but with some bluetooth type protocol that cars could use to send beacons to help other cars near it.

The most basic example of how this could be helpful is if the car ahead of you turns a sharp corner and crashes into a truck stopped in the road. Without car-to-car networking, you won't brake until the crash is in your line of sight.

Have you ever seen those youtube videos of massive car pile ups on highways caused by a crash, and then a cascade of additional crashes afterwards? E.g. icy conditions or dense fog. What if the original crash could communicate to cars behind it, wouldn't that be helpful if the crash isn't yet in the driver's (or car's) line of sight?

I agree "not much would change" overnight. It's just another input for the car's software to have at its disposal.

With the current hardware on the roads, I don't think it's technically possible for autos to achieve legitimate self-driving (if that's even the goal anymore?) - there are way too many edge cases that are way too difficult to solve for with just software.

pedestrians, cyclists, skateboarders, and all the other road users that the US car-centric society has determined are "hazards" to driving.
And what happens if there is a child on the road? Or are we going to need implanted transmitter chips in the future, so we can safely go outside and are not overrun by „smart“ cars?

Even if every car is required to be part of the network, there may be badly maintained cars that don’t work properly, or even malicious cars, that send wrong data on purpose.

It would create a better model, but this is not necessary. Cars are already "networked" through things like turn signals and brake lights.
Something more is necessary if "self-driving" is going to actually live up to its name at some point in the future, and I don't think the answer is 100% software.

At this point it's all about edge cases. Certain edge cases are impossible to overcome with just software + cameras alone.

Most humans can drive fairly well in heavy downpour, solely from the brake lights of the car and occasional glimpses of road markings. Thats almost equivalent to a very poor sensor suite.
For this to work, either (1) the network has to be reliable, and all cars have to be trustworthy (both from a security and fault tolerance perspective), or (2) the cars have to be safe even when disconnected from the network, such as during an evacuation.

We already know for sure that we can’t solve (1), which means we have to solve (2). Therefore, car-to-car communication is, at best, a value add, not the enabling technology.

> Imagine if Car A could improve its own understanding of the environment using inputs/sensor data from nearby Car B.

You can't rely on this in real time because urban canyons make it hard to get consistent cell signal (for one thing), but you can definitely improve your models on this data once the data's been uploaded to your offline systems, and some SDC companies do this.

A system of this sort could use some local area networking (think infrared, RF, or even lasers) to create an adhoc mesh network. It's how I imagine cars in the future to be networked at least.
I'd suggest giving Car Wars by Cory Doctorow a read. https://doctorow.medium.com/car-wars-a01718a27e9e

It involves a situation with networked self driving cars.

A total security nightmare.
Monocular cameras are a strange strawman. Is anyone seriously considering them?

Binocular cameras provide absolute depth information, and are an order of magnitude cheaper sensors than the other options.

Since this technology is clearly computationally limited, you should subtract the budget for the sensors from the budget for the computation.

According to the article, the non-camera sensors are in the $1000’s per car range, so the question becomes whether a camera system with an extra $2000 of custom asic / gpu / tpu compute is safer than a computationally-lighter system with a higher bandwidth sensor feed.

I’m guessing camera systems will be the safest economically-viable option, at least until the compute price drops to under a few hundred dollars.

So, assuming multi-camera setups really are first to market, the question then is whether the exotic sensors will ever be able to justify their cost (vs the safety win from adding more cameras and making the computer smarter).

It is not a strawman. Tesla FSD, in all forms, exclusively uses monocular cameras.

As seen on their website [1], and confirmed numerous times, they have monocular vision around the car and, though having three front-facing cameras, they each have different focal lengths and are located next to each other and thus can not operate as binocular vision.

[1] https://www.tesla.com/en_AE/autopilot/

Wow. I had no idea they were so far behind. I guess it should have been obvious from their miles between disengagement stats.
Tesla is so expensive and they can't install another camera. why...
At the risk of stating the obvious, stereovision in practice has a few interesting challenges. Yes, the main formula is deceptively simple: d = b*f / D (d - depth, D - disparity, b - baseline, f - focal length), but in practice, all 3 terms on the right require some thinking. The most difficult is D - disparity, it usually comes from some sort of feature matching algorithm, whether traditional or ML-based. Such algorithms usually require some texture surfaces to work properly, so if the surface does not have "enough" texture (example would be a gray truck in front of the cameras), then the feature matching will work poorly. In CV research there are other simplifying assumptions being made so that epipolar constraints make the task simpler. Examples of these assumptions are coplanar image planes, epipolar lines being parallel to a line connecting focal points and so on. In practice, these assumptions are usually wrong, so you need, for example, to rectify the images which is an interesting task by itself. Additionally, baseline b can drift due to changes in temperature and mechanical vibrations. So is the focal length f, so automatic camera calibration is required (not trivial).

Don't forget some interesting scenarios like dust particles or mud on one of the cameras (or windshield if cameras are located behind the windshield) or rain beading and distorting the image thus breaking the feature matcher and resulting disparity estimates.

Next, to "see" further, a stereo rig needs to have a decent baseline. For example, in a classic KITTI dataset, the baseline is approximately 0.54m which is much larger than, for example, human eyes (0.065m). Such baseline, 54cm, together with focal length, which, if I remember correctly, is about 720px in case of KITTI vehicle cameras, would give about 388m in the ideal case of being able to detect 1 pixel disparity. But detecting 1px of D is very difficult in practice - don't forget you will be running your algo on a car with limited compute resources. Say, you can have around 5px of D, that means max depth of around 77m - comparable to older Velodyne LiDARs.

Some of the issues I mentioned are not specific to stereovision (e.g. you need to calibrate monocular cameras as well and so on), just wanted to point out that stereovision does not magically enable depth perception. The solution would likely be a combination of monocular and stereo cameras, combined with SfM (Structure from Motion) and depth-from-stereo algorithms.

Isn't binocular information only useful for objects 10m ahead or closer? At least according to Hacker News, the most reliable source of information on the internet: https://news.ycombinator.com/item?id=36182151
This paper suggests that human vision maintains stereopsis much further out than many researchers have thought: “Binocular depth discrimination and estimation beyond interaction space” https://jov.arvojournals.org/article.aspx?articleid=2122030

They measured out to 18m & point out that the typical measured limits of angular resolution of the human eye mean that we could extract stereo image information out to 200m or more.

This paper claims to demonstrate stereopsis out to 250m, which is roughly the limit you’d expect from typical human visual acuity: “Stereoscopic perception of real depths at large distances” https://jov.arvojournals.org/article.aspx?articleid=2191614

This paper suggests that steropsis occurs out to somewhere between 20m & 65m before other cues dominate 3D depth perception: “The Role of Binocular Cues in Human Pilot Landing Control” https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30...

It seems that the claim that stereo vision only occurs in the near field case is probably wrong? Human stereo vision is much more capable than that & if it reaches out to significantly > 20m is surely being used when driving?

I think binocular depth resolution is roughly proportional to the space between the cameras. A car hood is much wider than a human head. I’m not sure how far you can push that without hitting issues with close up stuff.
Depends on the angular resolution of the sensor as well.
Yeah; proportional, assuming the sensor stays constant and hand waving about the FOV of the lens.
> According to the article, the non-camera sensors are in the $1000’s per car range.... I’m guessing camera systems will be the safest economically-viable option, at least until the compute price drops to under a few hundred dollars.

Human life is worth about $10 million (in US), that's a bit more than the sensor does. If one in 10,000 of camera-only car causes deaths, then it's not economically viable.

A London bus costs about $300,000, it is economically viable. Why is $1,000 sensor a problem. It is definitely viable to be installed on busses and trucks. Maybe you need to get out of the mindset of personal cars. It is not a viable business model, and it is not a viable model of dealing with congestion either.

If the car fleet kills half as many people as a human with a $1000/car system, and zero people with a $100,000 system, then we should immediately put the $1000 system on 100 times as many cars as we could put the $100,000 system on. (I picked those numbers because that is the order of magnitude range I have heard self driving car companies quote.)

The $100,000 system would only ever make sense if the fleet was already entirely self driving, and money for other life saving stuff like the environment and health care also hit suitable diminishing returns. Of course, by then, the cheap systems will have improved.

This argument holds for any non-negative dollar value you place on human life.

It is also independent of who owns the vehicles. Money the bus fleet spends in expensive self driving pulls money away from bus stop upgrades, pollution controls, etc, etc.

As far as I know humans can also safely drive only with one eye. It’s perfectly legal in most countries.

But I agree that current software (Tesla?) is not able to do that in the same way. So it may need more sensors until the software gets better.

In theory cameras should also be able to see more than humans. They can have a wider angle, higher contrast, higher resolution and better low-light vision than the human eye.

A human with one eye can use slight head movements and eye to gain a sense of depth. Perhaps the mono cameras need some kind of mount that allows them to not only look around but also move in 3 dimensions. That seems more complex than just having binocular cameras, though.
Yup! This kind of reconstruction is known as multi-view reconstruction. Though the cameras don't need to have a movable mount, they're already on a car which moves! The car moves and gives them a new "perspective" at every frame. That's how some monocular systems already work. Here's an example of one such system: https://github.com/nianticlabs/manydepth

That said, I think what you're referring to is more extreme perspectives that shift in ways the car cannot drive and you are correct that this would aid in reconstruction. This is how NERF models do their 3D reconstruction (https://nerfies.github.io/).

like pigeons.

Can't cameras do this by just comparing frame 1 to frame 2?

Compare how?
Minor nitpick

> monocular cameras can only provide relative depth

While the environment awareness is nowhere near as good as two or more cameras would be, if you consider the output over time, you get valuable information about the change rate of the environment, i.e. how fast that big thing is getting bigger, which may indicate one should actuate the brakes.

Of course, I'm with the crowd that answers the question with a "how many can we have?" question. The more, the merrier. And the more types, the better - give me polarized light and lidar, sonar, radar, thermal, and whatever else that can be plugged in the car's brain to make it better aware of what happens (and correctly guess what's going to happen) outside it.

Can you elaborate on your reasoning? I’m shaky on some of logic here.

“Monocular” cameras > no “absolute” depth > less safe

The last leap is not well justified.

Also, cars with vision based driving have multiple cameras. Whats the difference between a “binocular” camera and two “monocular” cameras?

How does a “binocular” camera get better depth information?

Is using multiple cameras to drive sensor fusion?

Why is absolute depth a strict safety win? How do you know how the sensor details translate to the final safety of the full system?

If this is just a handwavey upper bound on safety, how do you know that such a system can’t be safe enough for its design goals?

If humans with only one eye are able to drive, why wouldn’t mono surround vision be at least as good as that?

I am just a hobbyist but I can answer some of these.

> Whats the difference between a “binocular” camera and two “monocular” cameras?

For the camera itself, nothing. They are probably referring to the implementation. You can have two cameras side by side but unless you are using homography to estimate depth from the two images, then your setup is monocular.

> How does a “binocular” camera get better depth information?

A pixel in two images (with known separation) will have a geometric relationship that can be used to extract depth information. This is a lot faster than alternative methods with a single camera and multiple images.

> Is using multiple cameras to drive sensor fusion?

This is really just a question of semantics.

> Why is absolute depth a strict safety win?

Why is it better to have two eyes than one? You can be more certain about what you are seeing.

> If this is just a handwavey upper bound on safety, how do you know that such a system can’t be safe enough for its design goals?

If you had a system with infinite compute you could probably do enough math to calculate absolute depth with 100% certainty. I believe you can already extract absolute depth with something called bundle adjustment-- but it requires multiple images since you are relying on parallax effects. It is also computationally expensive.

> If humans with only one eye are able to drive, why wouldn’t mono surround vision be at least as good as that?

Computers are not humans.

> Why is absolute depth a strict safety win? How do you know how the sensor details translate to the final safety of the full system?

If you can get reliable depth information, the algorithm needed to avoid hitting stationary and slow-moving objects is extremely simple.

Is the stationary object in our path, of nontrivial size, and about to enter our minimum stopping distance? If yes, do we have a swerve planned that will let us safely avoid it? If no, emergency stop.

Because this logic is simple and well defined you can audit the implementation to the high standards applied to things like aircraft autopilot systems.

And it'll work even if the stationary object is something that didn't appear in your training data - you know the algorithm will work the same even if that concrete barrier is painted with some cheery flowers, or if that fire truck is airport yellow instead of the normal red.

Of course, this relies on the assumption you can get reliable depth information. If your depth sensor gets confused by a cloud of dust while driving in the desert, or gets blinded by the light of the setting sun, or is unable to detect a barbed wire fence, things are no longer quite so simple....

> If this is just a handwavey upper bound on safety, how do you know that such a system can’t be safe enough for its design goals?

Personally I would say that in freeway driving, a self-driving car should be able to avoid 100% of collisions with clearly visible stationary objects in dry, well lit conditions when all system components are in normal working condition.

"But humans can do it with one eye closed and"

... and I want to grab the guy who says that by the collar and scream in their face "The whole point is to build something that can do better than a human."

> People argue that humans don't really use our stereoscopic depth past ~10m or so and that's a fair point.

The second paper I reference in this comment https://news.ycombinator.com/edit?id=36232198 claims that humans can maintain stereopsis out to 250m.

That’s a huge difference from 10m & if true suggests that human drivers might well use 3D vision when driving.

I think I said this already in one of my comments but I'm not a neuroscientist and I don't claim to be. That's why I think it's kind of pointless and silly for me (or any other engineer) to sit here and make arguments about what humans do and don't do in their brains.

IMO it's better for us to focus on what the robots can and cannot do right now, and focus on solving those problems :)

Thanks for the sources though, those papers are definitely neat and I'll be taking a look when I get a chance.

I am implementing Monocular vSLAM as a side project right now. I am working with some optimization libraries like GTSAM but having some issues. Do you know any good resources for troubleshooting this kind of stuff?

It's pretty easy to see, even as someone with very little experience, the benefits of stereo vision over monocular. In addition to the depth stuff it's a lot easier/faster to create your point clouds from disparity maps.

Adding in the car speed and direction information to the monocular camera images gets you an absolute/measured understanding of the world.
> You can guess the absolute depth with your neural net but the estimates are pretty garbage.

I'm not sure what kind of systems you're referring to with "monocular cameras", but if you look at the visualization in a Tesla with FSD Beta, it's actually really good at detecting the position of everything. And that's with pretty bad cameras and not a lot of compute.

Only rarely you'll see Tesla's FSD mess up because of perception, the vast majority of times they mess up is just the software being dumb with planning.

Let’s say you are driving down the street in a suburban neighborhood. You see a kid throw a ball into the street. You see from how his body moved that it is a lightweight ball and that it doesn’t require drastic (or any) measures to avoid. Or you see that it is a very heavy object and requires evasive maneuvers.

How exactly does a certain type of sensor help with this? Isn’t the problem entirely based on a software model of the world?

> Setting aside comparisons to humans for a second (will get back to this), monocular cameras can only provide relative depth. You can guess the absolute depth with your neural net but the estimates are pretty garbage.

Stereoscopic vision in humans only works for nearby objects. The divergence for far away objects is not sufficient for this. You may think you can tell something is 50 or 55 meters away through stereoscopic vision, but you can't. That's your brain estimating based on effectively a single image.

That said reality is not a single image, it's a moving image, a video. Monocular video can still be used to estimate object distance in motion.

Eventually AI will be good enough to work better than humans with just a camera. The problem is we're not there yet, and what Tesla is doing is irresponsible. They should've added LIDAR and used that to train their camera-only models, until they're ready to take over.