Hacker News new | ask | show | jobs
by famouswaffles 657 days ago
It's not sloppy. It just doesn't matter in the limit of training.

1. An Octopus and a Raven have wildly different brains. Both are intelligent. So just the idea that there is some "one true system" that the NN must discover or converge on is suspect. Even basic arithmetic has numerous methods.

2. In the limit of training on a diverse dataset (ie as val loss continues to go down), it will converge on the process (whatever that means) or a process sufficiently robust. What gets the job done gets the job done. There is no way an increasingly competent predictor will not learn representations of the concepts in text, whether that looks like how humans do it or not.

3 comments

> whether that looks like how humans do it or not.

So you agree with me that there is no guarantee it learns any representation of the actual process that produced the training data.

Sure I agree. But if that's what you're getting hung up on, i think you've missed his point entirely.

Whether the machines becomes a human brain clone or something entirely alien is irrelevant. The point is, you can't cheat reality. Statistics is not magic. You can't predict text that understands without understanding.

Sure you can, and if your predictive engine doesn't have the generality and power of the original generative one, then you have no choice.

Machine learning isn't magic - the model will learn what it can to minimize the error over the specific provided loss function, and no more. Change the loss function and you change what the model learns.

In the case of an LLM trained with a predict next word loss function, what you are asking/causing the model to learn is NOT the generative process - you are asking it to learn the surface statistics of the training set, and the model will only learn what it needs to (and is able to, per the model architecture being trained) in order to do this.

Now of course learning the surface statistics well does necessitate some level of "understanding" - are we dealing with a fairy tale or a scientific paper for example, but there is only so much the model can do. Chess is a good example, since it's easy to understand. The generative process for world class chess (whether human, or for an engine) involves way more DEPTH (cf layers) of computation than the transformer has available to model it, so the best it can do is to learn the surface statistics via much shallower pattern recognition of the state of the board. Now, given the size of these LLMs, if trained on enough games they will be able to play pretty well even using this pattern matching technique, but one doesn't need to get too far into a chess game to reach a position that has never been seen before in recorded games (e.g. watch agadmator's YouTube chess channel - he will often comment when this point has been reached), and the model therefore has no choice but to play moves that were seen in the training set in similar, but not identical positions... This is basically cargo-cult chess! It's interesting that LLMs can reach the ELO level that they do (says more about chess than about LLMs), but this same "cargo-cult" (follow surface statistics) generation process when out of training set applies to all inputs, not just chess...

>the model will learn what it can to minimize the error over the specific provided loss function, and no more. Change the loss function and you change what the model learns.

You clearly do not really understand what it means to predict internet scale text with increasing accuracy. No more than that ? Fantastic

LLMs do not just learn surface statistics. So many papers have thoroughly disabused this that i'm just not going to bother. This is just straight up denial.

This havs been evidently shown in chess as well. https://arxiv.org/abs/2403.15498v2

You have no idea what you are talkin about. You've probably never even played with 3.5-turbo-instruct. That's how you can say this nonsense. You have your conclusion and keep working backwards to get a justification.

>It's interesting that LLMs can reach the ELO level that they do (says more about chess than about LLMs)

When you say this for everything LLMs can do then it just becomes a meaningless cope statement.

No of course not - they also learn whatever is necessary, and possible, in order to replicate those surface statistics (e.g. understanding of fairy tales, etc, as I noted).

However, you seem to be engaged in magical thinking and believe these models are learning things beyond their architectural limits. You appear to be star struck by what these models can do, and blind to what one can deduce - and SEE - they they are unable to do.

You've said a lot of things about LLM chess performance that is not true and can be easily shown to be not true. Literally evidence right there that shows the model learning the board state, rules, player skills etc.

And then you've tried to paper over being shown that with a conveniently vague and nonsensical, "says more about bla bla bla". No, you were wrong. Your model about this is wrong. It's that simple.

You start from your conclusions and work your way down from it. "pattern matching technique" ? Please. By all means, explain to all of us what this actually entails in a way we can test for it. Not just vague words.

No amount of training would cause a fly brain to be able to do what an octopus or bird brain can, or to model their behavioral generating process.

No amount of training will cause a transformer to magically sprout feedback paths or internal memory, or an ability to alter it's own weights, etc.

Architecture matters. The best you can hope for an LLM is that training will converge on the best LLM generating process it can be, which can be great for in-distribution prediction, but lousy for novel reasoning tasks beyond the capability of the architecture.

>No amount of training would cause a fly brain to be able to do what an octopus or bird brain can, or to model their behavioral generating process.

Go back a few evolutionary steps and sure you can. Most ANN architectures basically have relatively little to no biases baked in and the Transformer might be the most blank slate we've built yet.

>No amount of training will cause a transformer to magically sprout feedback paths or internal memory, or an ability to alter it's own weights, etc.

A transformer can perform any computation it likes in a forward pass and you can arbitrarily increase inference compute time with the token length. Feedback paths? Sure. Compute inefficient? Perhaps. Some extra programming around the Model to facilitate this ? Maybe but the architecture certainly isn't stopping you.

Even if it couldn't, limited =/ trivial. The Human Brain is not Turing complete.

Internal Memory ? Did you miss the memo ? Recurrency is overrated. Attention is all you need.

That said, there are already state keeping language model architectures around.

Altering weights ? Can a transformer continuously train ? Sure. It's not really compute efficient but architecture certainly doesn't prohibit it.

>Architecture matters

Compute Efficiency? Sure. What it is capable of learning? Not so much

> A transformer can perform any computation it likes in a forward pass

No it can't.

A transformer has a fixed number of layers - call it N. It performs N sequential steps of computation to derive it's output.

If a computation requires > N steps, then a transformer most certainly can not perform it in a forward pass.

FYI, "attention is all you need" has the implicit context of "if all you want to build is a language model". Attention is not all you need if what you actually want to build is a cognitive architecture.

Transformer produce the next token by manipulating K hidden vectors per layer, one vector per preceding token. So yes you can increase compute length arbitrarily by increasing tokens. Those tokens don't have to carry any information to work.

https://arxiv.org/abs/2310.02226

And again, human brains are clearly limited in the number of steps it can compute without writing something down. Limited =/ Trivial

>FYI, "attention is all you need" has the implicit context of "if all you want to build is a language model".

Great. Do you know what a "language model" is capable of in the limit ? No

These top research labs aren't only working on Transformers as they currently exist but it doesn't make much sense to abandon a golden goose before it has hit a wall.

> And again, human brains are clearly limited in the number of steps it can compute without writing something down

No - there is a loop between the cortex and thalamus, feeding the outputs of the cortex back in as inputs. Our brain can iterate for as long as it likes before initiating any motor output, if any, such as writing something down.

The brain's ability to iterate on information is still constrained by certain cognitive limitations like working memory capacity and attention span.

In practice, the cortex-thalamus loop allows for some degree of internal iteration, but the brain cannot endlessly iterate without some form of external aid (e.g., writing something down) to offload information and prevent cognitive overload.

I'm not telling you anything here you don't experience in your everyday life. Try indefinitely iterating on any computation you like and see how well that works for you.

You seem to repeatedly insist that hidden computation is a distinction of any relevance whatsoever.

First of all, your understanding of the architecture itself is mistaken. A transformer can iterate endlessly because each token it produces allows it a forward pass, and each of these tokens is postpended to its input in the next inference. That's the autoregressive in autoregressive transformer, and the entire reason why it was proposed for arbitrary seq2seq transduction.

This means you get layers * tokens iterations, where tokens is up to two million, and is in practice unlimited due to the LLM being able to summarize and select from that. Parallelism is irrelevant, since the transformer is sequential in the output of tokens. A transformer can iterate endlessly, it simply has to output enough tokens.

And no, the throughput isn't limited either, since each token gets translated into a high-dimensional internal representation, that in turn is influenced by each other token in the model input. Models can encode whatever they want not just by choosing a token, but by choosing an arbitrary pattern of tokens encoding arbitrary latent-space interactions.

Secondly, internal thoughts are irrelevant, because something being "internal" is an arbitrary distinction without impact. If I trained an LLM to prepend and postpend <internal_thought> to some part of its output, and then simply didn't show that part, then the LLM wouldn't magically become human. This is something many models do even today, in fact.

Similarly, if I were to take a human and modify their brain to only be able to iterate using pen and paper, or by speaking out loud, then I wouldn't magically make them into something non-human. And I would definitely not reduce their capacity for reasoning in any way whatsoever. There are people with aphantasia working in the arts, there are people without an internal monologue working as authors - how "internal" something is can be trivially changed with no influence on either the architecture or the capabilities of that architecture.

Reasoning itself isn't some unified process, neither is it infinite iteration. It requires specific understanding about the domain being reasoned over, especially understanding of which transformation rules are applicable to produce desired states, where the judgement about which states are desirable has to be learned itself. LLMs can reason today, they're just not as good at it than humans are in some domains.

You are confusing number of sequential steps with total amount of compute spent.

The input sequence is processed in parallel, regardless of length, so number of tokens has no impact on number of sequential compute steps which is always N=layers.

> Do you know what a "language model" is capable of in the limit ?

Well, yeah, if the language model is an N-layer transformer ...

Fair Enough.

Then increase N (N is almost always increased when a model is scaled up) and train or write things down and continue.

A limitless iteration machine (without external aid) is currently an idea of fiction. Brains can't do it so I'm not particularly worried if machines can't either.

How about spiders intelligence? They don’t even have brain
> In the limit of training on a diverse dataset (ie as val loss continues to go down), it will converge on the process (whatever that means) or a process sufficiently robust.

This is just moving the goal posts from "learning the actual process" to "any process sufficiently robust"

I didn't move anything because last i checked the term was Artificial Intelligence not Artificial exactly as a human does Intelligence