No, the "action" part is the distinction. Their world model is conditioned on robot actions for example, which gives you two things the video gen alone can't: predict the future frames that follow a given action (change the action, get a different future from the same starting frame), and run it in reverse to infer the actions behind observed frames or output the actions needed to hit a goal (the output is motor commands abd not video frames).
As I understand it, they mean both computer vision and video gen, linked by a pretty robust world model. One of their hosted examples is purely analysing an existing video, the other is predicting (i.e. video gen) from a static image to a video
If I were to hallucinate what it is and why it's worded that way: AI robot space is in need of a hyper-realistic game engine with better physics than Unity/Unreal style non-deformable rigid body mechanics, that's also way faster than 1x completely unlike engineering FEM sims, and this cater to that need
It can be used to generate synthetic data to train physical AI for robots, cars, drones, etc. The world can be simulated from first person perspective to generate training data without sending robots to peoples homes.