Hacker News new | ask | show | jobs
by dontlikeyoueith 656 days ago
> But it turns out that to ‘just learn’ the statistical correlations in text, to compress them really well, what the neural network learns is some representation of the process that produced the text

This is pretty sloppy thinking.

The neural network learns some representation of a process that COULD HAVE produced the text. (this isn't some bold assertion, it's just the literal definition of a statistical model).

There is no guarantee it is the same as the actual process. A lot of the "bow down before machine God" crowd is guity of this same sloppy confusion.

3 comments

It's not sloppy. It just doesn't matter in the limit of training.

1. An Octopus and a Raven have wildly different brains. Both are intelligent. So just the idea that there is some "one true system" that the NN must discover or converge on is suspect. Even basic arithmetic has numerous methods.

2. In the limit of training on a diverse dataset (ie as val loss continues to go down), it will converge on the process (whatever that means) or a process sufficiently robust. What gets the job done gets the job done. There is no way an increasingly competent predictor will not learn representations of the concepts in text, whether that looks like how humans do it or not.

> whether that looks like how humans do it or not.

So you agree with me that there is no guarantee it learns any representation of the actual process that produced the training data.

Sure I agree. But if that's what you're getting hung up on, i think you've missed his point entirely.

Whether the machines becomes a human brain clone or something entirely alien is irrelevant. The point is, you can't cheat reality. Statistics is not magic. You can't predict text that understands without understanding.

Sure you can, and if your predictive engine doesn't have the generality and power of the original generative one, then you have no choice.

Machine learning isn't magic - the model will learn what it can to minimize the error over the specific provided loss function, and no more. Change the loss function and you change what the model learns.

In the case of an LLM trained with a predict next word loss function, what you are asking/causing the model to learn is NOT the generative process - you are asking it to learn the surface statistics of the training set, and the model will only learn what it needs to (and is able to, per the model architecture being trained) in order to do this.

Now of course learning the surface statistics well does necessitate some level of "understanding" - are we dealing with a fairy tale or a scientific paper for example, but there is only so much the model can do. Chess is a good example, since it's easy to understand. The generative process for world class chess (whether human, or for an engine) involves way more DEPTH (cf layers) of computation than the transformer has available to model it, so the best it can do is to learn the surface statistics via much shallower pattern recognition of the state of the board. Now, given the size of these LLMs, if trained on enough games they will be able to play pretty well even using this pattern matching technique, but one doesn't need to get too far into a chess game to reach a position that has never been seen before in recorded games (e.g. watch agadmator's YouTube chess channel - he will often comment when this point has been reached), and the model therefore has no choice but to play moves that were seen in the training set in similar, but not identical positions... This is basically cargo-cult chess! It's interesting that LLMs can reach the ELO level that they do (says more about chess than about LLMs), but this same "cargo-cult" (follow surface statistics) generation process when out of training set applies to all inputs, not just chess...

>the model will learn what it can to minimize the error over the specific provided loss function, and no more. Change the loss function and you change what the model learns.

You clearly do not really understand what it means to predict internet scale text with increasing accuracy. No more than that ? Fantastic

LLMs do not just learn surface statistics. So many papers have thoroughly disabused this that i'm just not going to bother. This is just straight up denial.

This havs been evidently shown in chess as well. https://arxiv.org/abs/2403.15498v2

You have no idea what you are talkin about. You've probably never even played with 3.5-turbo-instruct. That's how you can say this nonsense. You have your conclusion and keep working backwards to get a justification.

>It's interesting that LLMs can reach the ELO level that they do (says more about chess than about LLMs)

When you say this for everything LLMs can do then it just becomes a meaningless cope statement.

No of course not - they also learn whatever is necessary, and possible, in order to replicate those surface statistics (e.g. understanding of fairy tales, etc, as I noted).

However, you seem to be engaged in magical thinking and believe these models are learning things beyond their architectural limits. You appear to be star struck by what these models can do, and blind to what one can deduce - and SEE - they they are unable to do.

No amount of training would cause a fly brain to be able to do what an octopus or bird brain can, or to model their behavioral generating process.

No amount of training will cause a transformer to magically sprout feedback paths or internal memory, or an ability to alter it's own weights, etc.

Architecture matters. The best you can hope for an LLM is that training will converge on the best LLM generating process it can be, which can be great for in-distribution prediction, but lousy for novel reasoning tasks beyond the capability of the architecture.

>No amount of training would cause a fly brain to be able to do what an octopus or bird brain can, or to model their behavioral generating process.

Go back a few evolutionary steps and sure you can. Most ANN architectures basically have relatively little to no biases baked in and the Transformer might be the most blank slate we've built yet.

>No amount of training will cause a transformer to magically sprout feedback paths or internal memory, or an ability to alter it's own weights, etc.

A transformer can perform any computation it likes in a forward pass and you can arbitrarily increase inference compute time with the token length. Feedback paths? Sure. Compute inefficient? Perhaps. Some extra programming around the Model to facilitate this ? Maybe but the architecture certainly isn't stopping you.

Even if it couldn't, limited =/ trivial. The Human Brain is not Turing complete.

Internal Memory ? Did you miss the memo ? Recurrency is overrated. Attention is all you need.

That said, there are already state keeping language model architectures around.

Altering weights ? Can a transformer continuously train ? Sure. It's not really compute efficient but architecture certainly doesn't prohibit it.

>Architecture matters

Compute Efficiency? Sure. What it is capable of learning? Not so much

> A transformer can perform any computation it likes in a forward pass

No it can't.

A transformer has a fixed number of layers - call it N. It performs N sequential steps of computation to derive it's output.

If a computation requires > N steps, then a transformer most certainly can not perform it in a forward pass.

FYI, "attention is all you need" has the implicit context of "if all you want to build is a language model". Attention is not all you need if what you actually want to build is a cognitive architecture.

Transformer produce the next token by manipulating K hidden vectors per layer, one vector per preceding token. So yes you can increase compute length arbitrarily by increasing tokens. Those tokens don't have to carry any information to work.

https://arxiv.org/abs/2310.02226

And again, human brains are clearly limited in the number of steps it can compute without writing something down. Limited =/ Trivial

>FYI, "attention is all you need" has the implicit context of "if all you want to build is a language model".

Great. Do you know what a "language model" is capable of in the limit ? No

These top research labs aren't only working on Transformers as they currently exist but it doesn't make much sense to abandon a golden goose before it has hit a wall.

> And again, human brains are clearly limited in the number of steps it can compute without writing something down

No - there is a loop between the cortex and thalamus, feeding the outputs of the cortex back in as inputs. Our brain can iterate for as long as it likes before initiating any motor output, if any, such as writing something down.

You are confusing number of sequential steps with total amount of compute spent.

The input sequence is processed in parallel, regardless of length, so number of tokens has no impact on number of sequential compute steps which is always N=layers.

> Do you know what a "language model" is capable of in the limit ?

Well, yeah, if the language model is an N-layer transformer ...

How about spiders intelligence? They don’t even have brain
> In the limit of training on a diverse dataset (ie as val loss continues to go down), it will converge on the process (whatever that means) or a process sufficiently robust.

This is just moving the goal posts from "learning the actual process" to "any process sufficiently robust"

I didn't move anything because last i checked the term was Artificial Intelligence not Artificial exactly as a human does Intelligence
A photograph is not the same as its subject, and it is not sufficient to reconstruct the subject, but it's still a representation of the subject. Even a few sketched lines are something we recognise as a representation of a physical object.

I think it's fair to call one process that can imitate a more complex one a representation of that process. Especially when in the very next sentence he describes it as a "projection", which has the mathematical sense of a representation that loses some dimensions.

> I think it's fair to call one process that can imitate a more complex one a representation of that process

I think it's sloppy.

YeS, exactly. The trick is to have enough tough data so you find optimal one. I think as we will scale models back to smaller sizes we will discover viable/correct representations