Hacker News new | ask | show | jobs
by fwlr 1111 days ago
I am sympathetic to this view (I would really love to see just how safe it’s possible to get), but I think the Musk/Karpathy-style argument for vision-only self-driving is quite strong, and it only seems flawed because it has been incorrectly simplified as “humans do driving with ~only vision -> computers should do driving with only vision”.

The proper argument is “humans do driving with ~only vision -> roads are therefore universally designed and built to be driven via by vision -> computers should do driving with only vision”. It is essentially a standards-based argument: since vision is the universal standard for driving, computers must be able to drive using just vision.

So vision is always going to be the core of self-driving. Why not augment with LIDAR anyway?

Well, in situations where vision and LIDAR are both right, you didn’t need LIDAR; in situations where vision is right and LIDAR is wrong, you didn’t need LIDAR and it potentially made you worse off; in situations where vision is wrong and LIDAR is right, you need to spend more on improving your vision; and in situations where both vision and LIDAR are wrong, you need to spend more on improving both, but improving vision is a higher priority. These are all the possible outcomes and none of them make a compelling case for investing in LIDAR.

2 comments

> hink the Musk/Karpathy-style argument for vision-only self-driving is quite strong

> humans do driving with ~only vision -> roads are therefore universally designed and built to be driven via by vision -> computers should do driving with only vision

What is 'should', is it a moral imperative? Is it a social obligation? Who made this argument, a catholic priest?

Where is consideration of this argument from an engineering perspectove - analysis of advantages disadvantages, where consideration of cost benefit? Where is assesment that, for example, 50% of human crashes are due to poor visibility or spatial awareness and comparison of how well computer handles them?

If I posted this vacuous, unsupported argument here, I would be laughed at, and rightly so.

But if Elon announces something, there is always 10% of the population willing to defend it, no matter how dumb it is.

I don't get it. Do you have an experience in designing navigation systems? In Stereo vision systems? In computer vision? Or is this just a "Musk bad therefore idea he has is bad" counterreaction to what he said?

Karpathy is one of the world's top self driving engineers. This isn't a vacuous argument. People are driving with just vision every single day. The part we're missing is the ChatGPT moment on the computational side.

I have produced 3D maps with lidars, drone mounted near-infrared cameras and with thermal infrared cameras.

You can tell apart grass and green carpet with a simple formula. You can coint trees without machine learning. Yoi can detect which plants are whilting, land that is wet from land that is dry. All of that is easy with the right sensors - becauae they have more data than an RGB camera can produce.

I know people that work with mutispectral imagery, they can tell you that pixel N45 has a spesific substance - concrete, steel or wood - jusy from spectra alone. Thye dont need to know what pixels around it are showing, or classify objects.

Agreed, I have a similar background with both LiDAR and vision for 3d reconstruction and mapping systems, plus I've designed some fairly impactful commercial multispectral software which is now widely used in the agricultural space. And vision can give you perfectly sufficient data to build world models and to localise yourself rapidly and robustly. What I believe is missing on the Tesla side is primarily on the navigation and 'social interaction' component of driving.

It's not like Waymo dropped a LiDAR onto the roofs of their vehicles and started driving unsupervised in traffic the next day. Nor Cruise, nor Uber. The sensing is just a small part of the whole system.

"It is difficult to get a man to understand something, when his salary depends on his not understanding it" seems apt here regarding both Karpathy and Elon. You can call me skeptical, but when there are millions and billions of dollars on the line for the two respectively, I don't know if I believe in Karpathy's expertise (which is in AI, not self-driving per se) and personal integrity sufficiently to believe he is doing what he considers to be the right thing vs. putting profit ahead of human lives.
Do you have any expertise on self driving, remote sensing, computer vision, navigation systems or anything else on this topic? Do you have a Tesla with the FSD package and participate in the beta program?

From what I've heard firsthand Autopilot still steadily improves, irrespective of what people say about their favourite sensing modalities...

radar is not lidar and is present on lots of vehicles that do L2/L3 driving except newer Tesla. optical sensors do not inherently tell you distance as a function of their sensing, whereas radar does.

a vision only approach _may_ be possible at some time, but only with a strong computational model of the human brain and thought process.

also, most people drive poorly— i wouldn’t say vision is the be-all-end-all of autonomous driving. it’s also clear that waymo and cruise have taken a full sensor based approach and are successful, whereas tesla is not.

I originally had “radar/LIDAR” everywhere you see “LIDAR” in that comment but it got really unwieldy halfway through. I think what I said generalizes from the specific example of LIDAR to other forms of sensing pretty well anyway, so you can just sub in radar if you want. The general principle is “vision” (in the sense of cameras feeding 2D image data into something that is probably a neural network) vs “everything else”. I would have said cameras vs sensors but some of the sensors use the visible light spectrum and so their sensors are called cameras. I like your use of “optical”, that might be the cleanest way to point at what I meant.

I broadly agree with your second point, about vision-only presenting big computational challenges. I think you do get some easy wins that bring down the challenge a bit - e.g. you don’t need to model human brains, you just need to model whatever the brain is doing when it’s driving; also the fact that we can teach people to drive without understanding what their brain is doing is a reassurance that we can teach a neural network to drive without understanding what it is doing either, so it frees us from (some) of the modeling of thought processes as well. But it is still a big computational challenge. I heard that Tesla has a server farm with thousands of Nvidia A100s, if true, that could make a dent in the problem for sure.

And yeah, I also wouldn’t say vision is the be-all and end-all when it comes to driving. (It’s a pity that we can’t easily integrate LiDAR, radar, and other sensors into the human brain so we could use them like we do sight and sound in order to drive better.)

My point is more that roads come in all shapes and types and sizes, but one consistent thing about them is that they’re all designed so that humans can use vision to drive on them. Like, you don’t know if future roads/signs/cars will be built in ways that are hard to read with LiDAR, but you can be pretty confident they won’t be built to be hard to see. Road builders, car makers - everyone else involved in the driving industry is designing for vision. It’s implicit, and it’s aimed at human vision, but it’s one of the few universal constraints on driving.

That’s what I mean when I say it’s a standards-based argument, that vision is sort of a “universal interface” for roads. Another “universal interface” for roads might be wheels (with traction), or more specifically tires. You don’t need to have rubber tires, or even wheels at all, to drive on roads - but if you do have tires, you can pretty confident that you can drive on pretty much any road you come across.

This is a compelling argument at the surface level (that roads are designed for humans with vision) that quickly breaks down when you examine how Tesla constructs their self-driving system.

Quick disclaimer that this doesn't reflect the views of my employer, nor does any of what I'm saying about self-driving software apply specifically to our system. Rather I am making broad generalizations about robotics systems in general, and about Tesla's system in particular based on their own Autonomy Day presentations.

When you drive on the road as a human, you rely a lot more on intuition and feel than exact measurements. This is exactly the opposite of how a self-driving car works. Modern robotics systems work by detecting every relevant actor in the scene (vehicles, cyclists, pedestrians etc.), measuring their exact size and velocity, predicting their future trajectories, and then making a centimeter level plan of where to move. And they do all of this 10s of times per second. It's this precision that we rely on when we make claims about how AVs are safer drivers than humans. To improve performance in a system like this, you need better more accurate measurements, better predictions and better plans. Every centimeter of accuracy is important.

By contrast, when you drive as a human it really is as simple as "images in, steering angle out". You just eyeball (pun intended) the rest. At no point in time can you look at the car in the lane next to you and tell its exact dimensions or velocity.

Now perhaps with millions of Nvidia A100s we could try to get to a system that's just "images in, steering angle out" but so far that has proven to be a pipe dream. The best research in the area doesn't even begin to approach the performance that we're able to get with our more classical robotics stack described above, and even Tesla isn't trying to end-to-end learn it all.

That isn't to say it's impossible (obviously, humans do it) but I think one could make a strong argument that "images in, steering angle out" is like epsilon close to just solving the problem of AGI, and perhaps even a million A100s wouldn't cut it ;)

That's not really true. Humans, at critical moments, do make implicit and even explicit plans of movement and follow them. We don't use literal velocity measurements for other objects, true, but in making those plans we do sometimes anticipate their locations at various points in the future, which is really what matters.

The best human drivers do this not at centimeter, but at the millimeter level. Look as downhill (motor)bike racing, Formula 1, WRC, etc..., These drivers can execute millimeter level accuracy maneuveurs that are planned well in advance at over 100km/h.

Yeah that's kind of what I was trying to say. You're right in that we predict the actions of others, but we don't do it in the same way. Even when we execute millimeter level maneuvers, we aren't explicitly measuring anything... Like if you were to ask a driver for instructions on how to repeat that maneuver they wouldn't be able to tell you, they just have a "feel" for it.

Basically humans are really really good at guesstimating with great accuracy (but poor reproducibility) and since we don't use basic measurements in the first place, having better measurement accuracy wouldn't really help us be better drivers on average (it does help for certain scenarios like parking though, where knowing the # of inches remaining to an obstacle can be very useful).

But for everyday driving at speed, we wouldn't even be able to process measurements in real time even if someone was providing them to us. AVs are different and that's basically the gist of what I was trying to say. Because they actually do use, rely on, and process measurements in real time, improving their measurement accuracy (ie. switching from camera based approximate depth, to cm level accurate depth from a LiDAR) can have a meaningful impact on the final performance of the system.