Hacker News new | ask | show | jobs
by vladimirralev 355 days ago
He is not using appropriate models for this conclusion and neither is he using state of the art models in this research and moreover he doesn't have an expensive foundational model to build upon for 2d games. It's just a fun project.

A serious attempt at video/vision would involve some probabilistic latent space that can be noised in ways that make sense for games in general. I think veo3 proves that ai can generalize 2d and even 3d games, generating a video under prompt constraints is basically playing a game. I think you could prompt veo3 to play any game for a few seconds and it will generally make sense even though it is not fine tuned.

7 comments

Veo3's world model is still pretty limited. That becomes obvious very fast once you prompt out of distribution video content (i.e. stuff that you are unlikely to find on youtube). It's extremely good at creating photorealistic surfaces and lighting. It even has some reasonably solid understanding of fluid dynamics for simulating water. But for complex human behaviour (in particular certain motions) it simply lacks the training data. Although that's not really a fault of the model and I'm pretty sure there will be a way to overcome this as well. Maybe some kind of physics based simulation as supplement training data.
What is the basis for it having a reasonable understanding of fluid dynamics? Why don’t you think it’s just regurgitating some water scenes derived from its training data, rather than generating actual fluid dynamics?
Because it can actually extrapolate to unseen cases while maintaining realism.
Ah yes, the classic “because it can” argument. I’ll take that to mean you don’t know what you’re talking about.
It seems you are confusing this with a personal opinion. This is not my opinion. This is merely the consensus of current research.

See here for example:

[1] https://arxiv.org/pdf/2410.18072

[2] https://arxiv.org/pdf/2411.02914v1

[3] https://openai.com/index/video-generation-models-as-world-si...

But even if you knew nothing about this topic, the observation that you simply couldn't store the necessary amount of video data in a model such that it could simply regurgitate it should give you a big clue as to what is happening.

Is any model currently known to succeed in the scenario that Carmack’s inappropriate model failed?
No monolithic models but us ng hybrid approaches we've been able to beet humans for some time now.
To confirm: hybrid approaches can demonstrate competence at newly-created video games within a short period of exposure, so long as similar game mechanics from other games were incorporated into their training set?
What you're thinking of is much more like the Genie model from DeepMind [0]. That one is like Veo, but interactive (but not publically available)

[0] https://deepmind.google/discover/blog/genie-2-a-large-scale-...

> I think veo3 proves that ai can generalize 2d and even 3d games, generating a video under prompt constraints is basically playing a game.

In the same way that keeping a dream journal is basically doing investigative journalism, or talking to yourself is equivalent to making new friends, maybe.

The difference is that while they may both produce similar, "plausible" output, one does so as a result of processes that exist in relation to an external reality.

> I think veo3 proves that ai can generalize 2d and even 3d games

It doesn't. And you said it yourself:

> generating a video under prompt constraints is basically playing a game.

No. It's neither generating a game (that people can play) nor is it playing a game (it's generating a video).

Since it's not a model of the world in any sense of the word, there are issues with even the most basic object permanenece. E.g. here's veo3 generating a GTA-style video. Oh look, the car spins 360 and ends up on a completely different street than the one it was driving down previously: https://www.youtube.com/watch?v=ja2PVllZcsI

It is still doing a great job for a few frames, you could keep it more anchored to the state of the game if you prompt it. Much like you can prompt coding agents to keep a log of all decisions previously made. Permanenece is excellent, it slips often but it mostly because it is not grounded to specific game state by the prompt or by the decision log.
So, "it generates a game" somehow "it's incapable of maintaining basic persistence without continuous prompting per frame".

Also, prompting doesn't work as you imply it does.

> generating a video under prompt constraints is basically playing a game

Besides static puzzles (like a maze or jigsaw) I don't believe this analogy holds? A model working with prompt constraints that aren't evolving or being added over the course of "navigating" the generation of the model's output means it needs to process 0 new information that it didn't come up with itself — playing a game is different from other generation because it's primarily about reacting to input you didn't know the precise timing/spatial details of, but can learn that they come within a known set of higher order rules. Obviously the more finite/deterministic/predictably probabilistic the video game's solution space, the more it can be inferred from the initial state, aka reduce to the same type of problem as generating a video from a prompt), which is why models are still able to play video games. But as GP pointed out, transfer function negative in such cases — the overarching rules are not predictable enough across disparate genres.

> I think you could prompt veo3 to play any game for a few seconds

I'm curious what your threshold for what constitutes "play any game" is in this claim? If I wrote a script that maps button combinations to average pixel color of a portion of the screen buffer, by what metric(s) would veo3 be "playing" the game more or better than that script "for a few seconds"?

edit: removing knee-jerk reaction language

It's not ideal, but you can prompt it with an image of a game frame, explain the objects and physics in text and let it generate a few frames of gameplay as a substitute for controller input as well as what it expects as an outcome. I am not talking about real interactive gameplay.

I am just saying we have proof that it can understand complex worlds and sets of rules, and then abide by them. It doesn't know how to use a controller and it doesn't know how to explore the game physics on its own, but those steps are much easier to implement based on how coding agents are able to iterate and explore solutions.

I think we need a spatial/physics model handling movement and tactics watched over by a high level strategy model (maybe an LLM).