Hacker News new | ask | show | jobs
by m3at 2494 days ago
It’s always better to have multiple sensor modalities available.

This is the main takeaway. Unsurprising but interesting nonetheless. I'm working in the field and it confirms my experience.

However they have a big bias that need to be pointed out:

[...] we must be able to annotate this data at extremely high accuracy levels or the perception system’s performance will begin to regress.

Since Scale has a suite of data labeling products built for AV developers, [...]

Garbage in, garbage out; yes annotation quality matters. But they're neglecting very promising approaches that allow to leverage non-annotated datasets (typically standard rgb images) to train models, for example self-supervised learning from video. A great demonstration of the usefulness of self-supervision is monocular depth estimation: taking consecutive frames (2D images) we can estimate per pixel depth and camera ego-motion by training to wrap previous frames into future ones. The result is a model capable of predicting depth on individual 2D frames. See this paper [1][2] for example.

By using this kind of approach, we can lower the need for precisely annotated data.

[1] https://arxiv.org/abs/1904.04998

[2] more readable on mobile: https://www.arxiv-vanity.com/papers/1904.04998/

Edit: typo

1 comments

> taking consecutive frames (2D images) we can estimate per pixel depth

Yeah, I find it odd that they're bringing up Elon's statement about LiDAR, but then completely ignore that they spoke about creating 3D models based on video. They even showed [0] how good of a 3d model they could create based on dat from their cameras. So they could just as well annotate in 3D.

0: https://youtu.be/Ucp0TTmvqOE?t=8217

Egomotion is very useful but relies on being able to reliably extract features from objects which isn't always possible. Smooth, monochromatic walls do exist and it's imperative a car be able to avoid them. It is possible for a human to figure out (almost always) their shape and distance form visual cues but our brains are throwing far more computational horsepower at the task than even Tesla's new computer has available. But perhaps knowing when it doesn't know is sufficient for their purposes and probably an easier task.

An interesting intermediate case between a pure video system and a lidar is a structured light sensor like the Kinect. In those you project a pattern of features onto an object in infrared. Doesn't work so well in sunlight but be interested in learning if someone had ever tried to use that approach with ego motion.

"Smooth, monochromatic walls do exist and it's imperative a car be able to avoid them."

Aren't those the types of walls, barriers, truck behinds that tesla's keep ramming into? :S

Maybe I missed it, I only watched part of that 4 hour video, but why don't they do like humans do and geometrically construct a Z-buffer representation from 2 or more cameras.

Then you'd get all that sweet, sweet depth data that lidar provides but cheaper and at a much higher resolution.

That was briefly touched on in the article:

> One approach that has been discussed recently is to create a pointcloud using stereo cameras (similar to how our eyes use parallax to judge distance). So far this hasn’t proved to be a great alternative since you would need unrealistically high-resolution cameras to measure objects at any significant distance.

Doing some very rough math, assuming a pair of 4K cameras with 50 degree FOV on opposite sides of the vehicle (for maximum stereo separation) and assuming you could perfectly align the pixels from both cameras, it seems you could theoretically measure depth with a precision of +/-75 cm for an object 70 meters away (a typical braking distance at highway speeds.) In practice, I imagine most of the difficulty is in matching up the pixels from both cameras precisely enough.