| I struggle with these world models from the perspective of video games (so this post is a particular perspective). I'm not a game developer myself, but some of my favorite games carry a deep sense of intentionality. For instance, there is typically not a single item misplaced in a FromSoftware game (or, for instance, Lies of P -- more recently). Almost every object is placed intentionally. Games which lack this intentionality often feel dead in contrast. You run into experiences which break immersion, or pull you out of the experience that the developer is trying to convey to you. It's difficult for me to imagine world models getting to a place where this sort of intentionality is captured. The best frontier LLMs fail to do this in writing (all the time), and even in code, and the surface of experiences for those mediums often feel "smaller" than the user interaction profile of a video game. It's not clear how these world models could be used modularly by humans hoping to develop intentional experiences? I don't know much about their usage (LLMs are somewhat modular: they can produce text, humans can work on it, other LLMs can work on it). Is the same true for the video output here? All this to say, I'm impressed with these world models, but similar to LLMs with writing, it's not really clear what it is that we are building towards? We are able to create less satisfying, less humane experiences faster? Perhaps the most immediate benefit is the ability for robotic systems to simulate actions (by conjuring a world, and imagining the implications). In general, I have the feeling that we are hurtling towards a world with less intentionality behind all the things we experience. Everything becomes impersonal, more noisy, etc. |
Making a world internally consistent by explicit placement gets harder as you increase in scale. When internal consistency is a factor impacting quality, there is a scale at which generated content eventually becomes the higher quality solution.
Secondly, when generating content with AI, the same rules around carelessness apply. There are certainly generative AI tools out there that offer few options when it comes to composing what you want, that is not a necessary part of AI, some of it is because people are wanting rudimentary interfaces, some of it is that the generators are sufficiently new that the control mechanisms are limited because they are focused upon doing something at all before doing it highly controlled, in some ways the problem is that things are new enough that it can be hard to describe what is desirable controllability, making the generator to see what people would like it to be able to do is, I think, a reasonable path to follow prior to creating the control that people want. Part of it is also that there _are_ tools that give a high level of control over what is generated but far fewer people get to see them. There are ways to control styles, object placement, camera motions, scene compositions, etc. The more specialised you get, the smaller the subset of people who need that specific control.
I think AI can make things possible for people who could not have done so without them, but it's still going to take care to make something special.