| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by polygamous_bat 849 days ago
	Firstly, do these models learn a good physics grounding for nonsense actions? Like keep pressing down even when you are in the ground? Or will they phase you through the ground? Secondly, why are all videos like half a second long? I thought video generation came much farther than this. My guess would be that the world models unravel at any length longer than that, which is (and has always been) the problem with models such as these. Minus the video generation part, we had pretty good world models for games already, see Dreamer line of work: https://danijar.com/project/dreamerv3/

1 comments

jparkerholder 849 days ago

Author here :) Re: 1) typically no, but of course it can hallucinate just like LLMs. 2) Agreed but the key point missing is Dreamer is trained from an RL environment with action labels. Genie is trained exclusively from videos and learns an action space. This is the first version of something that is now possible and will only improve with scale.

link

polygamous_bat 849 days ago

Thanks for braving the crowd here, you will unfortunately only find hard questions.

Anyway, about my second question: why are the videos only half second ish long? Does the model unravel after that?

Also

> This is the first version of something that is now possible and will only improve with scale.

11b params is already pretty large considering the stable diffusion and LLM scale. How much higher do we need to scale until we get something useful beyond simple setups?

link

jparkerholder 849 days ago

The bigger issue is lack of generating novel content rather than a total "unravel". We focus on OOD images because our motivation is generating diverse environments, but these are much harder to play for longer vs images closer to the training videos. It is interesting because one of the things you gain when going from 1B->10B is the OOD images working at all. Note it is not even trivial to detect the character given our model does not train with any labels or have any inductive biases to do so.

Point of clarification -- we don't expect bigger models to be the only way to improve this and are working on innovations on the modeling side, however we don't want to overlook the significance of scaling either :)

link

YeGoblynQueenne 846 days ago

>> Note it is not even trivial to detect the character given our model does not train with any labels or have any inductive biases to do so.

Why not add inductive biases then and make your life easier? What's with this choice to try and do everything the hard way, presumably to make a point? In the end the point made is so specific that it translates to nothing that is usable in real problems.

See MuZero for example- sure, you can learn without being given the rules explicitly, just from the win/loss signal, but then that only works in board games and atari games, and without the chance of a snowball in hell that it will work in the real world. We're dazzled by the technical prowess, but real utility? Where is that?

link