|
|
|
|
|
by soulofmischief
291 days ago
|
|
Two of them, giving us stereo vision. We are provided visual cues that encode depth. The ideal world model would at least have this. A world model for a video game on a monitor might be able to get away with no depth information, but a) normal engines do have this information and it would make sense to provide as much data to a general model as possible, and b) the models wouldn't work on AR/VR. Training on stereo captures seems like a win all around. |
|
None of these world models have explicit concepts of depth or 3D structure, and adding it would go against the principle of the Bitter Lesson. Even with 2 stereo captures there is no explicit 3D structure.