Hacker News new | ask | show | jobs
by radarsat1 1237 days ago
> trained only on Text-Image pairs and unlabeled videos

This is fascinating. It's able to pick up sufficiently on the fundamentals of 3D motion from 2D videos, while only needing static images with descriptions to infer semantics.