| It’s always better to have multiple sensor modalities available. This is the main takeaway. Unsurprising but interesting nonetheless. I'm working in the field and it confirms my experience. However they have a big bias that need to be pointed out: [...] we must be able to annotate this data at extremely high accuracy levels or the perception system’s performance will begin to regress. Since Scale has a suite of data labeling products built for AV developers, [...] Garbage in, garbage out; yes annotation quality matters. But they're neglecting very promising approaches that allow to leverage non-annotated datasets (typically standard rgb images) to train models, for example self-supervised learning from video. A great demonstration of the usefulness of self-supervision is monocular depth estimation: taking consecutive frames (2D images) we can estimate per pixel depth and camera ego-motion by training to wrap previous frames into future ones. The result is a model capable of predicting depth on individual 2D frames. See this paper [1][2] for example. By using this kind of approach, we can lower the need for precisely annotated data. [1] https://arxiv.org/abs/1904.04998 [2] more readable on mobile: https://www.arxiv-vanity.com/papers/1904.04998/ Edit: typo |
Yeah, I find it odd that they're bringing up Elon's statement about LiDAR, but then completely ignore that they spoke about creating 3D models based on video. They even showed [0] how good of a 3d model they could create based on dat from their cameras. So they could just as well annotate in 3D.
0: https://youtu.be/Ucp0TTmvqOE?t=8217