| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by HarHarVeryFunny 654 days ago

Sure you can, and if your predictive engine doesn't have the generality and power of the original generative one, then you have no choice.

Machine learning isn't magic - the model will learn what it can to minimize the error over the specific provided loss function, and no more. Change the loss function and you change what the model learns.

In the case of an LLM trained with a predict next word loss function, what you are asking/causing the model to learn is NOT the generative process - you are asking it to learn the surface statistics of the training set, and the model will only learn what it needs to (and is able to, per the model architecture being trained) in order to do this.

Now of course learning the surface statistics well does necessitate some level of "understanding" - are we dealing with a fairy tale or a scientific paper for example, but there is only so much the model can do. Chess is a good example, since it's easy to understand. The generative process for world class chess (whether human, or for an engine) involves way more DEPTH (cf layers) of computation than the transformer has available to model it, so the best it can do is to learn the surface statistics via much shallower pattern recognition of the state of the board. Now, given the size of these LLMs, if trained on enough games they will be able to play pretty well even using this pattern matching technique, but one doesn't need to get too far into a chess game to reach a position that has never been seen before in recorded games (e.g. watch agadmator's YouTube chess channel - he will often comment when this point has been reached), and the model therefore has no choice but to play moves that were seen in the training set in similar, but not identical positions... This is basically cargo-cult chess! It's interesting that LLMs can reach the ELO level that they do (says more about chess than about LLMs), but this same "cargo-cult" (follow surface statistics) generation process when out of training set applies to all inputs, not just chess...

1 comments

famouswaffles 654 days ago

>the model will learn what it can to minimize the error over the specific provided loss function, and no more. Change the loss function and you change what the model learns.

You clearly do not really understand what it means to predict internet scale text with increasing accuracy. No more than that ? Fantastic

LLMs do not just learn surface statistics. So many papers have thoroughly disabused this that i'm just not going to bother. This is just straight up denial.

This havs been evidently shown in chess as well. https://arxiv.org/abs/2403.15498v2

You have no idea what you are talkin about. You've probably never even played with 3.5-turbo-instruct. That's how you can say this nonsense. You have your conclusion and keep working backwards to get a justification.

>It's interesting that LLMs can reach the ELO level that they do (says more about chess than about LLMs)

When you say this for everything LLMs can do then it just becomes a meaningless cope statement.

link

HarHarVeryFunny 654 days ago

No of course not - they also learn whatever is necessary, and possible, in order to replicate those surface statistics (e.g. understanding of fairy tales, etc, as I noted).

However, you seem to be engaged in magical thinking and believe these models are learning things beyond their architectural limits. You appear to be star struck by what these models can do, and blind to what one can deduce - and SEE - they they are unable to do.

link

famouswaffles 654 days ago

You've said a lot of things about LLM chess performance that is not true and can be easily shown to be not true. Literally evidence right there that shows the model learning the board state, rules, player skills etc.

And then you've tried to paper over being shown that with a conveniently vague and nonsensical, "says more about bla bla bla". No, you were wrong. Your model about this is wrong. It's that simple.

You start from your conclusions and work your way down from it. "pattern matching technique" ? Please. By all means, explain to all of us what this actually entails in a way we can test for it. Not just vague words.

link

HarHarVeryFunny 654 days ago

An LLM will learn what it CAN (and needs to, to reduce the loss), but not what it CAN'T. How difficult is that to understand?!

Tracking probable board state given a sequence of moves (which don't even need to go all the way back to the start of the game!) is relatively simple to do, and doesn't require hundreds of sequential steps that are beyond the architecture of the model. It's just a matter of incrementally updating the current board state "hypothesis" per each new move (essentially: "a knight just moved to square X, so it must have moved away from some square a knight's move away from X that we believe currently contains a knight").

Ditto for estimating player ELO rating in order to predict appropriately good or bad moves. It's basically just a matter of how often the player makes the same move as other players of a given ELO rating in the training data. No need for hundreds of steps of sequential computation that are beyond the architecture of the model.

Doing an N-ply lookahead to reason about potential moves is a different story, but you want to ignore that and instead throw out a straw man "counter argument" about maintaining board state as if that somehow proves that the LLM can magically apply > N=layers of sequential reasoning to derive moves. Sorry, but this is precisely magical faith-based thinking "it can do X, so it can do Y" without any analysis of what it takes to do X and Y and why one is possible, and the other is not.

link

famouswaffles 654 days ago

>An LLM will learn what it CAN (and needs to to reduce the loss), but not what it CAN'T. How difficult is that to understand?!

Right and the point is that you don't know what it CAN'T learn. You clearly don't quite understand this because you say stuff like this:

>Chess is a good example, since it's easy to understand. The generative process for world class chess (whether human, or for an engine) involves way more DEPTH (cf layers) of computation than the transformer has available to model it

and it's just baffling because:

1. Humans don't play chess anything like chess engines. They literally can't because the brain has iterative computation limits well below that of a computer. Most Grandmasters are only evaluating 5 to 6 moves deep on average.

2. We have a chess transformer playing world class chess (grandmaster level) - https://arxiv.org/abs/2402.04494.

You keep trying to make the point that because a Transformer architecturally has a depth limit for some trained model, a, it cannot reach human level.

But this is nonsensical for various reasons.

- Nobody is stopping you from just increasing N such that every GI problem we care about is covered.

- You have shown literally no evidence that the N even state of the art models posses today is insufficient to match human iterative compute ability.

GPT-4o instant shots arbitrary arithmetic more accurately than any human brain and that's supposedly something it's bad at. You can clearly see it can reach world class chess play.

If you have some iterative computation benchmark that shows transformers zero shotting worse than an unaided human then feel free to show me.

link

HarHarVeryFunny 654 days ago

OK - you win. Today's LLMs are just as good as humans at reasoning.

Why don't you write Sam Altman to tell him the good news ?

Tell him there's nothing stopping him from "increasing N" until the thing get up and walks out the door.

link