Hacker News new | ask | show | jobs
by protortyp 13 days ago
No, the "action" part is the distinction. Their world model is conditioned on robot actions for example, which gives you two things the video gen alone can't: predict the future frames that follow a given action (change the action, get a different future from the same starting frame), and run it in reverse to infer the actions behind observed frames or output the actions needed to hit a goal (the output is motor commands abd not video frames).