Hacker News new | ask | show | jobs
by causal 21 days ago
I'm struggling to understand what this does.

> Generates future observations and action sequences.

Is that just a complicated way of saying video gen?

6 comments

No, the "action" part is the distinction. Their world model is conditioned on robot actions for example, which gives you two things the video gen alone can't: predict the future frames that follow a given action (change the action, get a different future from the same starting frame), and run it in reverse to infer the actions behind observed frames or output the actions needed to hit a goal (the output is motor commands abd not video frames).
As I understand it, they mean both computer vision and video gen, linked by a pretty robust world model. One of their hosted examples is purely analysing an existing video, the other is predicting (i.e. video gen) from a static image to a video
Look at the table of supported modalities. It can take in input of image/video/text/actions and output image/video/text/actions.
That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text?
If I were to hallucinate what it is and why it's worded that way: AI robot space is in need of a hyper-realistic game engine with better physics than Unity/Unreal style non-deformable rigid body mechanics, that's also way faster than 1x completely unlike engineering FEM sims, and this cater to that need
It can be used to generate synthetic data to train physical AI for robots, cars, drones, etc. The world can be simulated from first person perspective to generate training data without sending robots to peoples homes.
You can fine-tune it so, given an image and a task description, it generates a corresponding set of actions.