Hacker News new | ask | show | jobs
by tgog 2499 days ago
This completely neglects the fact that humans can build near perfect 3D representations of the world with 2D images stitched together with the parallax neural nets in our brain. This blogpost briefly mentions it in one line as a throwaway and says you'd need extremely high resolution cameras?? Doesn't make sense at all. Two cameras of any resolution spaced a regular distance apart should be able to build a better parallax 3D model than any one camera alone.
13 comments

The first thing we need to remember is the self driving doesn't work like our brain. If they do then we don't need to train them with billions of images. So the main problem is not just building the 3d models. For example we don't crash into the car because we never seen that car model or that kind of vehicle before. Check https://cdn.technologyreview.com/i/images/bikeedgecasepredic... we never think that there is a bike infront of us.

Humans do lot more than just identifying an image or doing 3d reconstruction. We have context about the roads, we constantly predict the movement of other cars, we do know how to react based on the situation and most importantly we are not fooled by simple image occlusions. Essentially we have a gigantic correlation engine that takes decision based on comprehending different things happening on the road.

The AI algorithms we teach does not work in the same way as we do. They overly depend on the identifying the image. Lidar provides another signal to the system. It provides redundancy and allows the system to take the right decision. Take the above linked image for an example.

We may not need a lidar once the technology matures but at this stage it is a pretty important redundant system.

> So the main problem is not just building the 3d models

That's not relevant when discussing which technology to use to build the 3d models. Everything you said is accurate until the last few sentences. Lidar provide the same information (line of sight depth) as stereo cameras, just in a different way. The person you're responding to is talking about depth from stereo, not cognition.

> Lidar provide the same information (line of sight depth) as stereo cameras, just in a different way.

This is incorrect, the amount of parallax you need to get the same kind of accurate depth using camera is infeasible. Velodynes other common lidar now gets you points accurate at 150m+. Cameras can't do that, and if you use nets to guess you'll still make mistakes.

> The person you're responding to is talking about depth from stereo, not cognition.

You miss the point; saying human 3D reconstruction works because of sensors without world context is naive. The response was trying to capture that; human perception systems utilize context / background knowledge extensively.

> the amount of parallax you need to get the same kind of accurate depth using camera is infeasible. Velodynes other common lidar now gets you points accurate at 150m+

I meant they both just provide line of sight depth.

The point being made by the first comment is that human eyeballs placed one inch apart are currently the gold standard for the actual looking part. So the right set of cameras is by definition sufficient for the looking part of driving. The cameras just have to replace eyes well enough. The brain replacement is farther down the chain.

From the OP:

> humans can build near perfect 3D representations of the world with 2D images stitched together with the parallax neural nets in our brain

This is a statement about cognition. And the response addresses this.

Your response:

> The person you're responding to is talking about depth from stereo, not cognition.

I think this is the disconnect. The person _is_ talking about cognition. OP makes a claim about how humans see, connected to how the human brain works. Response explains why camera-based image recognition right now is a lot worse than your eyes (a big piece of the answer is your brain).

> The cameras just have to replace eye well enough

So yes this is nice in theory. But I also get the sense most people don't realize just how large the chasm is today between cameras and human eyes. They don't "just provide line of sight depth." Dynamic range, field of view, reliability even under conditions like high heat -- there are many other dimensions where they just aren't analogous yet.

> The first thing we need to remember is the self driving doesn't work like our brain. If they do then we don't need to train them with billions of images.

I had always assumed that the first few years of infancy was effectively a period of training a neural net (the brain) against a continuous series of images (everything seen).

Where is the bike example from? All these instances of recognition error are meaningless when they don’t come from actual production systems by auto makers. They don’t just slap OpenCV into a car.
Having a redundant system is the key here.

Also provides a reliable source of data, if humans have a LiDAR in their system then we would use it to improve our decisions.

I don’t see why we should limit the AV.

The human brain is horrible at building truly accurate 3D representations of the world. Our mental maps are constantly missing a magnitude of details while tricking us and creating approximations to fill in the blanks.

Easy examples of this are optical illusions, ghosts, and ufos. There is also "selective attention tests" where a majority of people miss glaringly obvious events right in front of them, when they're focusing on something else. Regular people also tend to bump into things, spill things, and trip, even when going 3 miles an hour (walking speed).

Exactly. We don't build detailed accurate 3D maps. We build fuzzy semantic 2.5-ish-D maps that are 99% metadata. And they work incredibly well.
But at the same time people don't think much about getting in their cars and driving to work or the grocery store.

So it seems that a truly accurate 3D representations of the world are not necessary, at least for driving. Perhaps it's the resolution? Looking at the samples in the article, they are just terribly fuzzy, with a narrow field of view. If I had to drive and only see the world through that kind of view, I don't think I would be doing very well.

People also crash all the time. I'd be OK with AI crashing even slightly less than humans. Rabid shock-media and various luddites aren't.
We don't just have 2D data though.

We learn objects representations by interacting with them over years in a multi modal fashion. Take for example a simple drinking glass: we know its material properties (it is transparent, solid, can hold liquids), its typical position (stay on a tabletop, upright with the open side on top), its usage (grab it with a hand and bring to mouth)...

We also make heavy use of the time dimension, as over a few seconds we see the same objects from different view points and possibly in different states.

Only after learning what a glass is can we easily recover its properties on a still 2D image.

So at least for learning (might be skippable at inference), it makes a lot of sense to me to have more than 2D still images.

You're not responding to what they said. The person you're responding to is talking about depth from stereo, not cognition. Lidar _also_ doesn't know what the glass feels like.
I am, I was not writing about cognition here.

All I'm saying is that even with stereo inputs, we're doing more than computing depth from the baseline between left/right images. Close one eye and you can still estimate relative objects positions, because you learned that roads are mostly planar and cars don't float but stand on the road. You know what the expected size of a car is compared to, say, a human, and if the car is visually smaller than the human, it must be more far away.

Lidar _also_ doesn't know what the glass feels like.

Yes I agree with you, lidar and most current vision sensors also suffer from this.

People who have good vision in one eye can usually get their drivers licence without problems. So the depth from stereo is not a necessary part of driving for humans.
It doesn't matter how you estimate depth, but you do have to estimate it to drive, and the first step before you can estimate is that your eyes (eye in your example) need to see pictures. Light entering the eye is an entirely different stage in the process than reasoning about said light.
Others have commented about the human aspect.

> Two cameras of any resolution spaced a regular distance apart should be able to build a better parallax 3D model than any one camera alone.

This is true if the platform isn't moving.

If you have the time dimension and you have good knowledge of motion between frames (difficult), you can use the two views as a virtual stereo pair. This is called monocular visual/inertial-SLAM. You can supplement with GPS, 2D lidar, odometry and IMU to probabalistically fuse everything together. There have been some nice results published over the years.

But in general yes, you'll always be better off if you have a proper stereo pair with a camera either side of the car.

> humans can build near perfect 3D representations of the world

The idea that the human brain has a "near perfect" 3D representation of one's surroundings seems inaccurate to me. There's a difference between near perfection and good enough that people don't often get hurt, when all of their surroundings are deliberately constructed to limit exposure to danger.

I write code for industrial equipment and often get the request to fix a problem with software. The question "Can a computer do X" is too easy to answer in the affirmative - "Yes, but less accurately and only most of the time, and with a lot of time and money" gets condensed to "Yes" quickly.

And it is indeed an impressive and heroic piece of work when you can fix sensor problems with clever filtering, or fix mechanical problems with clever control algorithms. But when designing new equipment or deciding a path to fix a bad design, you never want to hamstring yourself from the start with poor quality input data and output actuators. That approach only leads to pain.

Once you have lots of experience with a particular design - dozens of similar machines running successfully in production for years - then you can start looking for ways to be clever and improve performance over the default or save a little money.

I understand Elon's desire to get lots of data. But there will be a much greater chance of success if it starts with Lidar + cameras, and a decade down the road you can work on camera-only controls and compare what they calculated and would have done to what the Lidar measured and the car actually responded. Only when these are sufficiently close should you phase out the Lidar.

Remember, you're comparing bad input data going to the best neural net known in the universe (the human brain) with millenia of evolution and decades of training data to sensor inputs to brand new programming. Help out the computer with better input data.

For human level driving a human level understanding of these scene from purely visual information is quite good enough. The first problem, though, is that the human brain has far more processing power than any computer that can fit in a car and probably more than any single computer yet constructed (estimating even to a single order of magnitude is hard). We're also leveraging millions of years of evolution though I'm not entirely sure how much of a difference that makes given how different our ancestral environment was from driving a car.

The other thing is that we, ideally, want a computer to drive a car better than a human can. There's a lot to be gained from having precise rather than approximate notions or other objects' distances and speeds in terms of driving both safely and efficiently. Now, Tesla has also got that Radar which when fused with visual data will help somewhat but I'm not sure how far that can get them.

Yes, we can. We can do it with one eye too.

but it takes at least 10 years to train.

But most of the time we are not building a 3d map from points. we are building it from object inference.

There are many advantages that we have over machines:

o The eye seens much beter in the dark o It has a massive dynamic range, allowing us to see both light and dark things o it moves to where the threat is o if it's occluded it can move to get a better image o it has a massive database of objects in context o each object has a mass, dimension, speed and location it should be seen in

None of those are 3d maps, they are all inference, where one can derrive the threat/advantage based on history.

We can't make machines do that yet.

you are correct that two cameras allows for better 3d pointcloud making in some situations. but a moving single camera is better than a static multiview camera.

however even then the 3d map isn't all that great, and has a massive latency compared to lidar.

I think most of our ability to judge relative distance is based on our brains judgement of lighting, texture, inference, and sound. While having two eyes helps a lot, you can still navigate a complex office environment with one eye closed. It just takes a bit more care.
When I was younger I remember hearing about how we can do all these things because we have 2 eyes. And that depth perception is what gives us the ability to not walk into walls, and do other things including driving.

I have thought about this many times and often wondered why when closing one eye I am still able to function.

Sense then I have thought strongly that having depth perception is used for training some other part of our brain, and then only used to increase accuracy of our perception of reality.

Further proof of this is TV. Even on varying sized screens humans tend to do well figuring out the actual size of things displayed.

Take one class on perception, read one textbook, you'll immediately find that stereo perception isn't very important. Your brain uses a host of depth queues, and stereo vision is just one of them.

Some of them translate trivially to photos/TV/etc, like convergent lines or texture gradient. Some of them are surprisingly physical, like feedback from your eyes about vergence or focal distance.

Stereo is highly effective up close, say within 10 meters (yards). And it works faster than many modes. It's absolutely fantastic for catching things out of the air. Given our intraocular distance, it's basically garbage past, I dunno, 30m or something? (obviously it degrades smoothly across distance)

I've heard more than one academic (evolutionary cognitive psychologists, etc) speculate that the single biggest evolutionary advantage of having two eyes is to have a spare in the event of damage. That might well be just whimsy and exaggeration, but I think it puts a helpful alternate perspective on it (pun!).

I'm skeptical of the claim that a major reason for having two eyes is depth perception.

One reason why you're still able to function is that you don't rely on your sense of depth that much these days. i.e. You don't need to gage where a spear or arrow will land. Even in a car, you are effectively on a one dimensional track and only have to decided to go left or right.

If you only had one eye, then in situations where there is lots of pressure to perceive depth I think you'd have to move your head around a lot.

Which makes me wonder, which human activities demand the best depth perception?

See my sibling comment: with respect to stereo vision, its greatest strength is nearby fast-moving things, great for stuff like dodging or catching or punching.

If you wanna launch spears or arrows, depth perception is incredibly important, but stereo vision will not help. Not with this intraocular distance, anyway.

Humans can determine the size of objects because we look for references in the scene and we understand the context.

If a person is standing next to a bush then we roughly know their height since we know the range of sizes that a bush could grow to. Likewise the size of someone like Thanos from Avengers would look odd in a documentary but because its a superhero movie we assume that's normal.

Self driving cars to my knowledge do none of this.

Stereo depth perception is not that important. People born without it end up being able to navigate pretty well, dodge walls, climb stairs, etc. It just takes practice.
Fun trick: Look at a photograph with one eye closed. Your brain will do ... something and the picture will look 3d
That would be the same trick it does when you look at it with both eyes open...
About 10 years ago I went to an eye doctor with a small object in my eye, and she had to cover it after removing the small object.

Driving back home with 1 eye was scary even though I was going much slower. It is possible to drive with 1 eye, but much much harder than with 2 eyes.

Did you drive any further than just the way home? I would bet most people would adapt quite quickly.
No, I wasn’t experimenting, but I haven’t had any car accidents in my life, and I find that more valuable
In these modern times yes, there's little selective pressure keeping depth perception sharp. That doesn't mean most of our ability to judge depth is from monocular clues (though that could be true).

https://en.wikipedia.org/wiki/Depth_perception#Theories_of_e...

There are also depth cues from https://en.wikipedia.org/wiki/Vergence#Convergence, right? As in focusing on the object itself?
Wikipedia lists 18 different types of depth cues that humans use!

https://en.wikipedia.org/wiki/Depth_perception

This seems like a bit of a double-edged sword. On the one hand, it means there's more than one way to achieve a 3D model of the world with cameras. On the other hand, it means that if what machines can do with cameras is going to match what we humans can do with our eyes, they will need to either advance along 18 different fronts or take some of those cues further than we can.

The most rudimentary life forms are little factories that build themselves. I think we should concentrate on making cars that build themselves and maybe then our technology will be sophisticated enough to consider looking into giving our cars human-like optical processing faculties.

Otherwise we'll just have to figure out how to build autonomous vehicles with the technology we have, which is pretty crappy in comparison to biology in a lot of ways still.

When a tree falls over a river, it creates a rudimentary bridge, as has happened for longer than humans have existed. Yet, while we can create huge suspension bridges from steel, we can't create wood.
This is getting into grey goo territory.
You cannot have false negatives. Ever. You cannot have a situation where the system doesn't see a pedestrian and runs over them at without noticing. So you need to make a very convincing argument that it can't happen.

With cameras and computer vision there's no way to prove it. There is always a chance that it will glitch out for a second and kill someone.

Autonomous vehicles don't need to be perfect drivers -- from it, they just need to be better than humans.
No, we accept humans as being imperfect but we do not accept machines as being imperfect. Yes this means that companies that have autonomous vehicles that have a lower accident rate than humans may still be completely unable to sell them because of legal issues and market perception.

We don’t know yet what the acceptance rate is for autonomous accidents - but I can guarantee it’s not the rational value of 1:1 or “as safe as humans”. They’ll need to do a lot better.

Yes and there's also a kind of lying with statistics that goes on. That human accident rate includes drunk drivers, very young drivers, very old drivers, etc.

The average accident rate is not your expected accident rate, if you are an average person who is not in those categories.

A million people die per year due to road deaths, about 40,000 of those in the USA.

If what you say is true then a future where robot cars kill 500,000 per year and 20,000 in the USA would be considered acceptable.

Yet we know this is absolutely not the case, no society will ever stand for such a massive death toll due to robot usage. Are there any industries today where robots are allowed to kill so many?

We accept deaths because of human failing as there is no other way, the alternative is no cars.

So for us to hand over the reins to robots they need to be near perfect, think the accident rates of the airline industry as the only acceptable goal.

> near perfect 3D representations of the world with 2D images

This is ridiculous.

I am sitting in front of a monitor right now. Please explain how I can perfectly determine the depth of it even though I can't see behind it ? I can move my ahead all around it to capture hundreds of different viewpoints but a car can't do that.

Nobody made a rule that says cars can't have cameras in more than one location.
When moving, cars can compare hundreds of different viewpoints. Multiple cameras provide for depth perception when stationary.
it’s too bad that cars can’t move to get additional points of view.
Not like a human can, actually. Fixed cameras are fixed relative to the bodywork, necks are not. OTOH, if I move my head it's usually to get around a blind spot, something cameras have less of an issue with.
Cameras do not perform saccades, for starters... The hardware isn't as analogous as it might seem.