Hacker News new | ask | show | jobs
by dontlikeyoueith 655 days ago
> whether that looks like how humans do it or not.

So you agree with me that there is no guarantee it learns any representation of the actual process that produced the training data.

1 comments

Sure I agree. But if that's what you're getting hung up on, i think you've missed his point entirely.

Whether the machines becomes a human brain clone or something entirely alien is irrelevant. The point is, you can't cheat reality. Statistics is not magic. You can't predict text that understands without understanding.

Sure you can, and if your predictive engine doesn't have the generality and power of the original generative one, then you have no choice.

Machine learning isn't magic - the model will learn what it can to minimize the error over the specific provided loss function, and no more. Change the loss function and you change what the model learns.

In the case of an LLM trained with a predict next word loss function, what you are asking/causing the model to learn is NOT the generative process - you are asking it to learn the surface statistics of the training set, and the model will only learn what it needs to (and is able to, per the model architecture being trained) in order to do this.

Now of course learning the surface statistics well does necessitate some level of "understanding" - are we dealing with a fairy tale or a scientific paper for example, but there is only so much the model can do. Chess is a good example, since it's easy to understand. The generative process for world class chess (whether human, or for an engine) involves way more DEPTH (cf layers) of computation than the transformer has available to model it, so the best it can do is to learn the surface statistics via much shallower pattern recognition of the state of the board. Now, given the size of these LLMs, if trained on enough games they will be able to play pretty well even using this pattern matching technique, but one doesn't need to get too far into a chess game to reach a position that has never been seen before in recorded games (e.g. watch agadmator's YouTube chess channel - he will often comment when this point has been reached), and the model therefore has no choice but to play moves that were seen in the training set in similar, but not identical positions... This is basically cargo-cult chess! It's interesting that LLMs can reach the ELO level that they do (says more about chess than about LLMs), but this same "cargo-cult" (follow surface statistics) generation process when out of training set applies to all inputs, not just chess...

>the model will learn what it can to minimize the error over the specific provided loss function, and no more. Change the loss function and you change what the model learns.

You clearly do not really understand what it means to predict internet scale text with increasing accuracy. No more than that ? Fantastic

LLMs do not just learn surface statistics. So many papers have thoroughly disabused this that i'm just not going to bother. This is just straight up denial.

This havs been evidently shown in chess as well. https://arxiv.org/abs/2403.15498v2

You have no idea what you are talkin about. You've probably never even played with 3.5-turbo-instruct. That's how you can say this nonsense. You have your conclusion and keep working backwards to get a justification.

>It's interesting that LLMs can reach the ELO level that they do (says more about chess than about LLMs)

When you say this for everything LLMs can do then it just becomes a meaningless cope statement.

No of course not - they also learn whatever is necessary, and possible, in order to replicate those surface statistics (e.g. understanding of fairy tales, etc, as I noted).

However, you seem to be engaged in magical thinking and believe these models are learning things beyond their architectural limits. You appear to be star struck by what these models can do, and blind to what one can deduce - and SEE - they they are unable to do.

You've said a lot of things about LLM chess performance that is not true and can be easily shown to be not true. Literally evidence right there that shows the model learning the board state, rules, player skills etc.

And then you've tried to paper over being shown that with a conveniently vague and nonsensical, "says more about bla bla bla". No, you were wrong. Your model about this is wrong. It's that simple.

You start from your conclusions and work your way down from it. "pattern matching technique" ? Please. By all means, explain to all of us what this actually entails in a way we can test for it. Not just vague words.

An LLM will learn what it CAN (and needs to, to reduce the loss), but not what it CAN'T. How difficult is that to understand?!

Tracking probable board state given a sequence of moves (which don't even need to go all the way back to the start of the game!) is relatively simple to do, and doesn't require hundreds of sequential steps that are beyond the architecture of the model. It's just a matter of incrementally updating the current board state "hypothesis" per each new move (essentially: "a knight just moved to square X, so it must have moved away from some square a knight's move away from X that we believe currently contains a knight").

Ditto for estimating player ELO rating in order to predict appropriately good or bad moves. It's basically just a matter of how often the player makes the same move as other players of a given ELO rating in the training data. No need for hundreds of steps of sequential computation that are beyond the architecture of the model.

Doing an N-ply lookahead to reason about potential moves is a different story, but you want to ignore that and instead throw out a straw man "counter argument" about maintaining board state as if that somehow proves that the LLM can magically apply > N=layers of sequential reasoning to derive moves. Sorry, but this is precisely magical faith-based thinking "it can do X, so it can do Y" without any analysis of what it takes to do X and Y and why one is possible, and the other is not.