| As someone who works in the industry (disclaimer these are my own views and don't reflect those of my employer), something about the framing of this article rubs me the wrong way despite the fact that it's mostly on point. Yes it is true that different companies are choosing different sensing solutions based on cost and the ODD in which they must operate. But I think this last sentence just left a sour taste in my mouth "But the verdict is still out as to which is safer". It is not an open question and I hate it when writers frame it this way. Camera only (specifically, monocular camera only) systems literally cannot be safer than ones with sensor fusion right now. This may change in the future at some point, but it's not a question right now it is a fact. Setting aside comparisons to humans for a second (will get back to this), monocular cameras can only provide relative depth. You can guess the absolute depth with your neural net but the estimates are pretty garbage. Unfortunately, robots can't work/plan with this input. The way any typical robotics stack works is that it relies on an absolute/measured understanding of the world in order to make its plans. That isn't to say that one day with sufficiently powerful ML and better representations we would be totally unable to use mono (relative) depth. People argue that humans don't really use our stereoscopic depth past ~10m or so and that's a fair point. But we also don't plan the way robots do. We don't require accurate measurements of distance and size. When you're squeezing your car into a parking spot you don't measure your car and then measure the spot to know if it'll fit. You just know. You just do it. And it's a guesstimate (so sometimes humans make mistakes and we hit stuff). Robots don't work this way (for now), so their sensors cannot work this way either (for now). |
From how humans drive, its pretty clear that there exists some latent space representation of immediate surroundings inside our brains that doesn't require a lot of data. If you had a driving sim wheel and 4 monitors for each direction + 3 smaller ones for rear view mirror, connected to a real world car with sufficiently high definition cameras, you could probably drive the car remotely as well as you could in real life, all because the images would map to the same latent space.
But the advantage that humans have is that we have an innate understanding of basic physics from experience in interacting with the world, which we can deduce from something simple as a 2d representation, and that is very much a big part of that latent space. You wouldn't be able to drive a car if you didn't have some "understanding" of things like velocity, acceleration, object collision, e.t.c
So my bet is that just like with LLMs, there will be research published at some point that given certain frames in a video, it will be able to extrapolate the physical interactions that will occur, including things like collision, relative distances, and so on. Once that is in place, self driving systems will get MASSIVELY better.