Hacker News new | ask | show | jobs
by D-Machine 137 days ago
Let's be more precise: LLMs have to model the world from an intermediate tokenized representation of the text on the internet. Most of this text is natural language, but to allow for e.g. code and math, let's say "tokens" to keep it generic, even though in practice, tokens mostly tokenize natural language.

LLMs can only model tokens, and tokens are produced by humans trying to model the world. Tokenized models are NOT the only kinds of models humans can produce (we can have visual, kinaesthetic, tactile, gustatory, and all sorts of sensory, non-linguistic models of the world).

LLMs are trained on tokenizations of text, and most of that text is humans attempting to translate their various models of the world into tokenized form. I.e. humans make tokenized models of their actual models (which are still just messy models of the world), and this is what LLMs are trained on.

So, do "LLMS model the world with language"? Well, they are constrained in that they can only model the world that is already modeled by language (generally: tokenized). So the "with" here is vague. But patterns encoded in the hidden state are still patterns of tokens.

Humans can have models that are much more complicated than patterns of tokens. Non-LLM models (e.g. models connected to sensors, such as those in self-driving vehicles, and VLMs) can use more than simple linguistic tokens to model the world, but LLMs are deeply constrained relative to humans, in this very specific sense.

2 comments

I don't get the importance of the distinction really. Don't LLMs and Large non-language Models fundamentally work kind of similarly underneath? And use similar kinds of hardware?

But I know very little about this.

you are correct the token representation gets abstracted away very quickly and is then identical for textual or image models. It's the so-called latent space and people who focus on next token prediction completely missed the point that all the interesting thinking takes place in abstract world model space.
> you are correct the token representation gets abstracted away very quickly and is then identical for textual or image models.

This is mostly incorrect, unless you mean "they both become tensor / vector representations (embeddings)". But these vector representations are not comparable.

E.g. if you have a VLM with a frozen dual-backbone architecture (say, a vision transformer encoder trained on images, and an LLM encoder backbone pre-trained in the usual LLM way), then even if, for example, you design this architecture so the embedding vectors produced by each encoder have the same shape, to be combined via another component, e.g. some unified transformer, it will not be the case that e.g. the cosine similarity between an image embedding and a text embedding is a meaningful quantity (it will just be random nonsense). The representations from each backbone are not identical, and the semantic structure of each space is almost certainly very different.

They do not model the world.

They present a statistical model of an existing corpus of text.

If this existing corpus includes useful information it can regurgitate that.

It cannot, however, synthesize new facts by combining information from this corpus.

The strongest thing you could feasibly claim is that the corpus itself models the world, and that the LLM is a surrogate for that model. But this is not true either. The corpus of human produced text is messy, containing mistakes, contradictions, and propaganda; it has to be interpreted by someone with an actual world model (a human) in order for it to be applied to any scrnario; your typical corpus is also biased towards internet discussions, the english language, and western prejudices.

If we focus on base models and ignore the tuning steps after that, then LLMs are "just" a token predictor. But we know that pure statistical models aren't very good at this. After all we tried for decades to get Markov chains to generate text, and it always became a mess after a couple of words. If you tried to come up with the best way to actually predict the next token, a world model seems like an incredibly strong component. If you know what the sentence so far means, and how it relates to the world, human perception of the world and human knowledge, that makes guessing the next word/token much more reliable than just looking at statistical distributions.

The bet OpenAI has made is that if this is the optimal final form, then given enough data and training, gradient descent will eventually build it. And I don't think that's entirely unreasonable, even if we haven't quite reached that point yet. The issues are more in how language is an imperfect description of the world. LLMs seems to be able to navigate the mistakes, contradictions and propaganda with some success, but fail at things like spatial awareness. That's why OpenAI is pushing image models and 3d world models, despite making very little money from them: they are working towards LLMs with more complete world models unchained by language

I'm not sure if they are on the right track, but from a theoretical point I don't see an inherent fault

There's plenty of faults in this idea.

First, the subjectivity of language.

1) People only speak or write down information that needs to be added to a base "world model" that a listener or receiver already has. This context is extremely important to any form of communication and is entirely missing when you train a pure language model. The subjective experience required to parse the text is missing.

2) When people produce text, there is always a motive to do so which influences the contents of the text. This subjective information component of producing the text is interpreted no different from any "world model" information.

A world model should be as objective as possible. Using language, the most subjective form of information is a bad fit.

The other issue in this argument is that you're inverting the implication. You say an accurate world model will produce the best word model, but then suddenly this is used to imply that any good word model is a useful world model. This does not compute.

> People only speak or write down information that needs to be added to a base "world model" that a listener or receiver already has

Which companies try to address with image, video and 3d world capabilities, to add that missing context. "Video generation as world simulators" is what OpenAI once called it

> When people produce text, there is always a motive to do so which influences the contents of the text. This subjective information component of producing the text is interpreted no different from any "world model" information.

Obviously you need not only a model of the world, but also of the messenger, so you can understand how subjective information relates to the speaker and the world. Similar to what humans do

> The other issue in this argument is that you're inverting the implication. You say an accurate world model will produce the best word model, but then suddenly this is used to imply that any good word model is a useful world model. This does not compute

The argument is that training neural networks with gradient descent is a universal optimizer. It will always try to find weights for the neural network that cause it to produce the "best" results on your training data, in the constraints of your architecture, training time, random chance, etc. If you give it training data that is best solved by learning basic math, with a neural architecture that is capable of learning basic math, gradient descent will teach your model basic math. Give it enough training data that is best solved with a solution that involves building a world model, and a neural network that is capable of encoding this, then gradient descent will eventually create a world model.

Of course in reality this is not simple. Gradient descent loves to "cheat" and find unexpected shortcuts that apply to your training data but don't generalize. Just because it should be principally possible doesn't mean it's easy, but it's at least a path that can be monetized along the way, and for the moment seems to have captivated investors

You did not address the second issue at all. You are inverting the implication in your argument. Whether gradient descent helps solve the language model problem or not does not help you show that this means it's a useful world model.

Let me illustrate the point using a different argument with the same structure: 1) The best professional chefs are excellent at cutting onions 2) Therefore, if we train a model to cuy onions using gradient descent, that model will be a very good profrssional chef

2) clearly does not follow from 1)

I think the commenter is saying that they will combine a world model with the word model. The resulting combination may be sufficient for very solid results.

Note humans generate their own non-complete world model. For example there are sounds and colors we don’t hear or see. Odors we don’t smell. Etc…. We have an incomplete model of the world, but we still have a model that proves useful for us.

> It cannot, however, synthesize new facts by combining information from this corpus.

That would be like saying studying mathematics can't lead to someone discovering new things in mathematics.

Nothing would ever be "novel" if studying the existing knowledge could not lead to novel solutions.

GPT 5.2 Thinking is solving Erdős Problems that had no prior solution - with a proof.

The Erdos problem was solved by interacting with a formal proof tool, and the problem was trivial. I also don't recall if this was the problem someone had already solved prior but not reported, but that does not matter.

The point is that the LLM did not model maths to do this, made calls to a formal proof tool that did model maths, and was essentially working as the step function to a search algorithm, iterating until it found the zero in the function.

That's clever use of the LLM as a component in a search algorithm, but the secret sauce here is not the LLM but the middleware that operated both the LLM and the formal proof tool.

That middleware was the search tool that a human used to find the solution.

This is not the same as a synthesis of information from the corpus of text.

  It cannot, however, synthesize new facts by combining information from this corpus.
Are we sure? Why can't the LLM use tools, run experiments, and create new facts like humans?
Then the LLM is not actually modelling the world, but using other tools that do.

The LLM is not the main component in such a system.

So do we expect real world models to just regurgitate new facts from their training data?
Regurgitating facts kind of assumes it is a language model, as you're assuming a language interface. I would assume a real "world model" or digital twin to be able to reliably model relationships between phenomena in whatever context is being modeled. Validation would probably require experts in whatever thing is being modeled to confirm that the model captures phenomena to some standard of fidelity. Not sure if that's regurgitating facts to you -- it isn't to me.

But I don't know what you're asking exactly. Maybe you could specify what it is you mean by "real world model" and what you take fact-regurgitating to mean.

  But I don't know what you're asking exactly. Maybe you could specify what it is you mean by "real world model" and what you take fact-regurgitating to mean.
You said this:

  If this existing corpus includes useful information it can regurgitate that.It cannot, however, synthesize new facts by combining information from this corpus.
So I'm wondering if you think world models can synthesize new facts.
they do model the world. Watch Noble price winner Hinton or let's admit that this is more of a religious question then the technical.
They model the part of the world that (linguistic models of the world posted on the internet) try to model. But what is posted on the internet is not IRL. So, to be glib: LLMs trained on the internet do not model IRL, they model talking about IRL.
His point is that human language and the written record is a model of the world, so if you train an LLM you're training a model of a model of the world.

That sounds highly technical if you ask me. People complain if you recompress music or images with lossy codecs, but when an LLM does that suddenly it's religious?

A model of a model of X is a model of X, albeit extra lossy.
An LLM has an internal linguistic model (i.e. it knows token patterns), and that linguistic model models humans' linguistic models (a stream of tokens) of their actual world models (which involve far, far more than linguistics and tokens, such as logical relations beyond mere semantic relations, sensory representations like imagery and sounds, and, yes, words and concepts).

So LLMs are linguistic (token pattern) models of linguistic models (streams of tokens) describing world models (more than tokens).

It thus does not in fact follow that LLMs model the world (as they are missing everything that is not encoded in non-linguistic semantics).

At this point, anyone claiming that LLMs are "just" language models aren't arguing in good faith. LLMs are a general purpose computing paradigm. LLMs are circuit builders, the converged parameters define pathways through the architecture that pick out specific programs. Or as Karpathy puts it, LLMs are a differentiable computer[1]. Training LLMs discovers programs that well reproduce the input sequence. Tokens can represent anything, not just words. Roughly the same architecture can generate passable images, music, or even video.

[1] https://x.com/karpathy/status/1582807367988654081

In this case this is not so. The primary model is not a model at all, and the surrogate has bias added to it. It's also missing any way to actually check the internal consistency of statements or otherwise combine information from its corpus, so it fails as a world model.