Hacker News new | ask | show | jobs
by vamin 3220 days ago
I disagree with the analogy to speech recognition. What makes speech recognition difficult is that it's highly dependent on context (sounds the same, but means something different depending on surrounding words, what the conversation is about, or even who you're speaking with). With driving, you should be able to make a good decision with just the instantaneous state, given enough information about that state (objects, velocities, etc). You can argue about what constitutes "enough information," but it seems plausible to me that given enough sensors we could meet or exceed the amount of information taken in by a pair of human eyes on a swivel.
2 comments

Im skeptical you can do real autonomous driving without context. In DC, we have roads that flow one direction part of the day and another direction the other part. We've got roads that will be shut down unpredictably when there is a diplomatic event. We've got constant construction, where a two lane road might be reduced to one lane with a human holding a sign or using hand signals to usher cars through on their turns. How does a self-driving car handle that without understanding context?
For a few more very common examples of humans using context:

A human will spot a person wearing headphones and recognize that person has a low situational awareness. The computer doesn't come close to even having the optical resolution to do that if the AI was perfect - remember human vision is 570+ megapixels, even a 4K video stream is literally two orders of magnitude lower.

[Now think about the fact that if we built a camera capable of recording 400 megapixels, you'd currently need to schlep around a ~750 lbs 25 node cluster, consuming about 50 horsepower to feed it with electricity, just to be able to process the video stream at 25 fps. Moore's law aint' growing that fast these days, so matching the resolution of human vision is not a realistic option.]

Another example is kids. How does the AI recognize that the 5'1" 30-year-old woman has much better awareness and can be treated differently from the 5'2" 12-year-old boy? Humans can spot that difference even from behind.

How about recognizing an adult who is drunk? Or a blind person? Mourners at a funeral, or fans celebrating after a football game? Or a million other conditions that significantly affect pedestrian situational awareness that human drivers will instantly infer from context?

What will happen when kids figure out they can stop a driverless car on its way to collect its owner just by standing in the street in front of it? They'll have a lot of fun, for sure.

How about when carjackers figure out the same? That they can dress up like construction workers, stop the car in the street, tow it onto a flatbed with built-in RF jammer and head straight for their underground chop shop? There goes your cheaper insurance.

> remember human vision is 570+ megapixels

This seems to come from http://www.clarkvision.com/articles/eye-resolution.html

But that number is a calculation of the maximum resolving power of the human eye filled across a 120 degree field of view. The fovea is the only portion of the retina that actually attains that acuity and it encompasses roughly 2 degrees in the center of the retina.

There are roughly 120 million rod cells and 6 million cone cells in the retina. The rod cells for color vision and cone cells for low light. As each individual rod cell is primarily sensitive to one of red, green or blue they match fairly well to the rgb channels of a pixel. So the eye could be considered to provide data roughly equivalent to a 40 megapixel color image and grayscale 6 megapixel. So ~5 times a 4k image.

Edit: And even that actually over estimates the amount of data the brain is actually processing. A 4k 60 fps video is handled by 6Gbps and the human optic nerve only has roughly 8.75Mbps of bandwidth.

A lot of processing is done "on the road", starting from ganglia in the retina itself; so a direct comparison with 4k 60fps video is grossly incorrect.
The computer could just assume the worst case:

That all people are classified as drunk children wearing headphones with low situational awareness.

If the self-driving proponents were to say that human drivers would be banned, and all vehicles would therefore be self-driving I'd have more respect for their arguments. That said, you still need to account for things like pedestrians and snow. If we're talking about "self-driving on the I-5 when it isn't raining and there are no human drivers permitted" then yes I think we're probably close..
For a self driving car, it's sufficient if its understanding of weird situations is limited to detecting "yep, I'm in a weird situation, I'm going to stop now and wait for a remote human to take control".

The situations you describe are rare - I've once had a diplomatic event that required weird rerouting and twice had cases where traffic was regulated by hand signals due to some crash on the road, but that means just a few cases over a whole lifetime. A system that can't solve these cases but recognizes them as unsolvable is a quite acceptable automated system if it can delegate control to a human inside or a remote dispatcher, which isn't that hard to do.

These situations aren't rare at all. Just in the last few months, I've had: change in left-turn traffic pattern leaving my office onto a major road; humans guiding traffic on 2-3 separate occasions (very common around July 4th); cones dramatically changing lane patterns in construction zones; two occasions of cops blocking off a road with their cars to let a motorcade pass.

And this is just me driving (i.e. my car is parked 90% of the day). If you're talking about a self-driving Uber in D.C., one of the above events will happen on a daily basis.

Your mention of humans guiding traffic reminds me of advice from my father many years ago: never assume that a human giving you direction when you're driving is giving you good information. Always evaluate whether what they're conveying to you may be either misunderstood (e.g. what does "wave" mean??) or just plain false. Hard for a human to do.
I imagine traffic in D.C. is incredibly atypical compared to most US cities. Issues common to drivers in D.C. are probably very rare to most drivers in the US, including those in other major metropolitan areas.
I live in New England and the poster's description of DC traffic sounds like what I see all the time, from cities like Boston or even small towns like East Longmeadow.

http://www.masslive.com/news/index.ssf/2017/05/east_longmead...

We actually now understand that passing control to a human is seriously dangerous - unless you can do it a minute or so in advance, the human has to switch context from being a passive passenger (or, more likely, actively engrossed in something else) into being an active driver very quickly. Everything up to level 4 automation will cause accidents when the car attempts to hand control to a human.

Of course, there's a very reasonable argument that e.g. level 3 automation might cause fewer accidents overall, even if it kills people when it has no idea what to do, but convincing Joe Public that such a car with such a known flaw is safe is another matter.

>"yep, I'm in a weird situation, I'm going to stop now and wait for a remote human to take control".

Not that's not sufficient.

If ten people did that in a critical area during a high demand hour it would be a news story and there would be criminal charges depending on the details.

If you redefine "sufficient" to include stopping your car on the George Washington bridge because it's confused by a construction zone it still doesn't solve the backup you cause.

The exceptional cases can often be the most important, especially when you are talking about moving humans from place to place. I wonder how a self-driving car would react to something like driving in a hurricane?
But you, as a human, don't need to know what happened before, to be able and drive through those conditions. You just need the immediately-available information, which is what OP defined as "no need for context" vs speech recognition.
How does a self driving car differentiate between a police officer flagging you down and a carjacker? Humans can make this judgement because we have a context for each of those situations based on our understanding of how the world works outside of just operating a vehicle safely.
To be fair, humans can't do this either, or there wouldn't be such a thing as carjackers.

One of the biggest challenges that automated systems face is that the acceptable failure rate for them is far below the acceptable failure rate for humans in the same role. To err is human...

yes but even humans are much better drivers on roads they are familiar with.
It would not be too costly to augment roads with electronic signalling devices which give information to the software in cars. This information can do what signboards or traffic signals do for humans.

The difficult part - when there is an error in these signals, or things shut down, autonomous cars will suffer much bigger problems than human driven cars.

Its these edge cases that are the problem. We already rely on such mechanisms for planes(information comes from both gorund control and on-flight radar). But a lot of care and resources are is required to get to $n 9's level of reliability.

> It would not be too costly to augment roads with electronic signalling devices which give information to the software in cars.

Sounds costly to me. There are a lot more roads than airport runways. And then the big question is: Who is going to pay for it?

Remote driving assistance by a retired Uber driver on basic income is a possible low-tech solution.
By context you mean a timeline of events while speech recognition is actually dependent on that timeline. Same applies to situational state recognition in driving. E.g., to derive speed, acceleration and direction of objects.
No, in this context (pun intended) of speech recognition, the context means external context, i.e., understanding lots of information about the topic of that speech, knowing what would the speaker might plausibly be trying to say, what real world entities might be involved, and how are they called/spelled - all kinds of information that is not included in the original audio data, things that the listener would know based on life experience.