| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by corimaith 292 days ago
	The definition of a language model is literally the probability distribution of the most likely next token given a preceding text. When OP says "memorizing patterns and repeating stuff", it's a strawman of a basic n-gram model, obviously with modern language it's more advanced because we techniques like vector tokenization, but at it's core it's still just probability that's limited to the corpus it was trained on. Or at it's core, if you give it question that it's never seen, what's the most likely reply you might get, and it will give you that. But dosen't mean there is a internal world-model or anything, it's ultimately wether you think language is sufficient to model reality, which I probably think not. It obviously would be very convincing, but not necessairly correct.

1 comments

nearbuy 292 days ago

This isn't true at all. The LLMs absolutely world model and researchers have shown this many times on smaller language models.

> techniques like vector tokenization

(I assume you're talking about the input embedding.) This is really not an important part of what gives LLMs their power. The core is that you have a large scale artificial neural net. This is very different than an n-gram model and is probably capable of figuring out anything a human can figure out given sufficient scale and the right weights. We don't have that yet in practice, but it's not due to a theoretical limitation of ANNs.

> probability distribution of the most likely next token given a preceding text.

What you're talking about is an autoregressive model. That's more of an implementation detail. There are other kinds of LLMs.

I think talking about how it's just predicting the next token is misleading. It's implying it's not reasoning, not world-modeling, or is somehow limited. Reasoning is predicting, and predicting well requires world-modeling.

corimaith 290 days ago

>This is really not an important part of what gives LLMs their power. The core is that you have a large scale artificial neural net.

What seperates transformers from LSTMs is their ability to proccess the entire corpus in parallel rather in-sequence and the inclusion of the more efficient "attention" mechanism that allows them to pick up long range dependencies across a language. We don't actually understand the full nature of the latter, but I suspect that is the basis behind the more "intelligent" actions of the LLM. There's quite a general range of problems that a long-range-dependency was encompass, but that's still ultimately limited by language itself.

But if you're talking about this being a fundamentally a probability distribution model, I stand by that, because that's literally the mathematical model (softmax for the encoder and decoder) that's being used in transformers here. It very much is generating a probability distribution over the vocabulary and just picking the highest probability (or beam search) as your next output.

>The LLMs absolutely world model and researchers have shown this many times on smaller language models.

We don't have a formal semantic definition of a "world model", I would take alot of what these researchers are writing with a grain of salt because something like that crosses more into philosophy (especially in the limits of language and logic) than hard engineering that these researchers are trained on.