|
|
|
|
|
by godelski
173 days ago
|
|
As a machine learning researcher, I don't get why these are called world models. Visually, they are stunning. But it's nowhere near physical. I mean look at that video with the girl and lion. The tail teleports between legs and then becomes attached to the girl instead of the tiger. Just because the visuals are high quality doesn't mean it's a world model or has learned physics. I feel like we're conflating these things. I'm much happier to call something a world model if its visual quality is dogshit but it is consistent with its world. And I say its world because it doesn't need to be consistent with ours |
|
With this kind of image gen, you can sorta plan robot interactions, but its super slow. I need to find the paper that deepmind produced, but basically they took the current camera input, used a text prompt like "robot arm picks up the ball", the video generated the arm motion, then the robot arm moved as it did in the video.
The problem is that its not really a world model, its just image gen. Its not like the model outputs a simulation that you can interact with (without generating more video) Its not like it creates a bunch of rough geo that you can then run physics on (ie you imagine a setup, draw it out and then run calcs on it.)
There is lots of work on making splats editable and semantically labeled, but again thats not like you can run physics on them so simulation is still very expensive. Also the properties are dependent on running the "world model" rather than querying the output at a point in time