|
|
|
|
|
by bonoboTP
260 days ago
|
|
This is not the final target. It's video generation now, but that's just a stepping stone. The real thing is that learning a generator is also learning a prior over videos, and hence over how the world works. The real application of this will be word models, vision-language action models, spatial AI and robotics. Basically a kind of learned simulator in which to plan and imagine possible futures, possible actions and affordances etc. Video models could become a spatial reasoning platform too. A recent paper by deepmind (using veo3) showed that video models can perform many high level vision tasks out of the box. Don't think it's going to end here at some slop feed. |
|
The final target of these "world models" on a 20 year horizon is entirely unmanned factories taking over the economy, and swarm of drones and robots fighting wars and policing citizens.
This is why hundreds of billions are poured into these things, cute Ghibli style videos and vacuum robots wouldn't be worth this much money otherwise.