| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by causal 21 days ago

I'm struggling to understand what this does.

> Generates future observations and action sequences.

Is that just a complicated way of saying video gen?

6 comments

protortyp 21 days ago

No, the "action" part is the distinction. Their world model is conditioned on robot actions for example, which gives you two things the video gen alone can't: predict the future frames that follow a given action (change the action, get a different future from the same starting frame), and run it in reverse to infer the actions behind observed frames or output the actions needed to hit a goal (the output is motor commands abd not video frames).

link

swiftcoder 21 days ago

As I understand it, they mean both computer vision and video gen, linked by a pretty robust world model. One of their hosted examples is purely analysing an existing video, the other is predicting (i.e. video gen) from a static image to a video

link

derac 21 days ago

Look at the table of supported modalities. It can take in input of image/video/text/actions and output image/video/text/actions.

link

causal 21 days ago

That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text?

link

numpad0 21 days ago

If I were to hallucinate what it is and why it's worded that way: AI robot space is in need of a hyper-realistic game engine with better physics than Unity/Unreal style non-deformable rigid body mechanics, that's also way faster than 1x completely unlike engineering FEM sims, and this cater to that need

link

heliosAtwork 21 days ago

It can be used to generate synthetic data to train physical AI for robots, cars, drones, etc. The world can be simulated from first person perspective to generate training data without sending robots to peoples homes.

link

ainch 21 days ago

You can fine-tune it so, given an image and a task description, it generates a corresponding set of actions.

link