|
|
|
|
|
by radarsat1
1237 days ago
|
|
> trained only on Text-Image pairs and unlabeled videos This is fascinating. It's able to pick up sufficiently on the fundamentals of 3D motion from 2D videos, while only needing static images with descriptions to infer semantics. |
|