| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by famouswaffles 656 days ago

Transformer produce the next token by manipulating K hidden vectors per layer, one vector per preceding token. So yes you can increase compute length arbitrarily by increasing tokens. Those tokens don't have to carry any information to work.

https://arxiv.org/abs/2310.02226

And again, human brains are clearly limited in the number of steps it can compute without writing something down. Limited =/ Trivial

>FYI, "attention is all you need" has the implicit context of "if all you want to build is a language model".

Great. Do you know what a "language model" is capable of in the limit ? No

These top research labs aren't only working on Transformers as they currently exist but it doesn't make much sense to abandon a golden goose before it has hit a wall.

2 comments

HarHarVeryFunny 656 days ago

> And again, human brains are clearly limited in the number of steps it can compute without writing something down

No - there is a loop between the cortex and thalamus, feeding the outputs of the cortex back in as inputs. Our brain can iterate for as long as it likes before initiating any motor output, if any, such as writing something down.

famouswaffles 656 days ago

The brain's ability to iterate on information is still constrained by certain cognitive limitations like working memory capacity and attention span.

In practice, the cortex-thalamus loop allows for some degree of internal iteration, but the brain cannot endlessly iterate without some form of external aid (e.g., writing something down) to offload information and prevent cognitive overload.

I'm not telling you anything here you don't experience in your everyday life. Try indefinitely iterating on any computation you like and see how well that works for you.

HarHarVeryFunny 656 days ago

What's your point?

The discussion is about the architecturally imposed limitations of LLMs, resulting in capabilities that are way less than that of a brain.

The fact that the brain has it's own limits doesn't somehow negate this fact!

famouswaffles 656 days ago

My point is that for some bizare reason, people have standards of reasoning (for machines) that only exist in fiction or their own imagination.

It is beyond silly to dump an architecture for a limitation the human brain has. A reasoning engine that can iterate indefinitely with no external aid does not exist in real life. That the transformer also has this weakness is not any reason for it to have capabilities less than a brain so it's completely moot.

HarHarVeryFunny 655 days ago

LLMs are here to stay until something better replaces them, and will be used for those things they are capable of.

It shouldn't be surprising they are not great at reasoning, or everything one would hope for from an AGI, since they simply were not built for that. If you look at the development history, the transformer was a successor to LSTM-based seq-2-seq models using Bahdanau attention, whose main goal was to more efficiently utilize parallel hardware by supporting parallel processing. Of course a good language model (word predictor) will look as if it's reasoning because it is trying to model the data it was trained on - a human reasoner.

As humans we routinely think for seconds/minutes or even hours before speaking or acting, while an LLM only has that fixed N steps (layers) of computation. I don't know why you claim this difference (among others) should make no difference, but it clearly does, with out-of-training-set reasoning weakness being a notable limitation that people such as Demis Hassabis have recently conceded.

famouswaffles 654 days ago

Reasoning is reasoning. "Look as if it is reasoning" is an imaginary distinction you've made up. One that is very clear because everybody touting this "fake reasoning" rhetoric is still somehow unable to define a testable version of reasoning that disqualifies LLMs without also disqualifying some chunk of humans.

>As humans we routinely think for seconds/minutes or even hours before speaking or acting

No human is iterating on a base thought for hours uninterrupted lol so this is just moot

>with out-of-training-set reasoning weakness being a notable limitation that people such as Demis Hassabis have recently conceded.

Humans reason weaker out of training. LLMs are simply currently worse

accountnum 655 days ago

You seem to repeatedly insist that hidden computation is a distinction of any relevance whatsoever.

First of all, your understanding of the architecture itself is mistaken. A transformer can iterate endlessly because each token it produces allows it a forward pass, and each of these tokens is postpended to its input in the next inference. That's the autoregressive in autoregressive transformer, and the entire reason why it was proposed for arbitrary seq2seq transduction.

This means you get layers * tokens iterations, where tokens is up to two million, and is in practice unlimited due to the LLM being able to summarize and select from that. Parallelism is irrelevant, since the transformer is sequential in the output of tokens. A transformer can iterate endlessly, it simply has to output enough tokens.

And no, the throughput isn't limited either, since each token gets translated into a high-dimensional internal representation, that in turn is influenced by each other token in the model input. Models can encode whatever they want not just by choosing a token, but by choosing an arbitrary pattern of tokens encoding arbitrary latent-space interactions.

Secondly, internal thoughts are irrelevant, because something being "internal" is an arbitrary distinction without impact. If I trained an LLM to prepend and postpend <internal_thought> to some part of its output, and then simply didn't show that part, then the LLM wouldn't magically become human. This is something many models do even today, in fact.

Similarly, if I were to take a human and modify their brain to only be able to iterate using pen and paper, or by speaking out loud, then I wouldn't magically make them into something non-human. And I would definitely not reduce their capacity for reasoning in any way whatsoever. There are people with aphantasia working in the arts, there are people without an internal monologue working as authors - how "internal" something is can be trivially changed with no influence on either the architecture or the capabilities of that architecture.

Reasoning itself isn't some unified process, neither is it infinite iteration. It requires specific understanding about the domain being reasoned over, especially understanding of which transformation rules are applicable to produce desired states, where the judgement about which states are desirable has to be learned itself. LLMs can reason today, they're just not as good at it than humans are in some domains.

HarHarVeryFunny 655 days ago

Sure - a transformer can iterate endlessly by generating tokens, but this is no substitute for iterating internally and maintaining internal context and goal-based attention.

One reason why just blathering on endlessly isn't the same as thinking deeply before answering, is that it's almost impossible to maintain long-term context/attention. Try it. "Think step by step" or other attempts to prompt the model into generating a longer reply that builds upon itself, will only get you so far because keeping a 1-dimensional context is no substitute for the thousands of connections we have in our brain between neurons, and the richness of context we're therefore able to maintain while thinking.

The reasoning weakness of LLMs isn't limited to "some domains" that they had less training data for - it's a fundamental architecturally-based limitation. This becomes obvious when you see the failure modes of simple problems like "how few trips does the farmer need to cross the river with his chicken & corn, etc" type problems. You don't need to morph the problem to require out-of-distribution knowledge to get it to fail - small changes to the problem statement can make the model state that crossing the river backwards and forwards multiple times without loading/unloading anything is the optimal way to cross the river.

But, hey, no need to believe me, some random internet dude. People like Demis Hassabis (CEO of DeepMind) acknowledge the weakness too.

famouswaffles 654 days ago

>You don't need to morph the problem to require out-of-distribution knowledge to get it to fail

make the slight variation look different from the version it have memorized and it often passes. Sometimes it's as straightforward as just changing the names. humans have this failure mode too.

accountnum 655 days ago

> One reason why just blathering on endlessly...

First of all, I would urge you to stop arbitrarily using negative words to make an argument. Saying that LLMs are "blathering" is equivalent to saying you and I are "smacking meat onto plastic to communicate" - it's completely empty of any meaning. This "vibes based arguing" is common in these discussions and a massive waste of time.

Now, I don't really understand what you mean by "almost impossible to maintain long-term context/attention". I'm writing fiction in my spare time, LLMs do very well on this by my testing, even subtle and complex simulations of environments, including keeping track of multiple "off-screen" dynamics like a pot boiling over.

There is nothing "1-dimensional" about the context, unless you mean that it is directional in time, which any human thought is as well, of course. As I said in my original reply, each token is represented by a multidimensional embedding, and even that is abstracted away by the time inference reaches the later layers. The word "citrus" isn't just a word for the LLM, just as it isn't just a word for you. Its internal representation retrieves all the contextual understanding that is related to it. Properties, associated feelings, usage - every relevant abstract concept is considered. And these concepts interact which every embedding of every other token in the input in a learned way, and with the position they have relative to each other. And then when an output is generated from that dynamic, said output influences the dynamic in a way that is just as multidimensional.

The model can maintain context as rich as it wants, and it can built upon that context in whatever way it wants as well. The problem is that in some domains, it didn't get enough training time to build robust transformation rules, leading it to draw false conclusions.

You should reflect on why you are only able to provide vague and under defined, often incorrect, arguments here. You're drawing distinctions that don't really exist and trying to hide that by appealing to false intuitions.

> The reasoning weakness... it's a fundamental architecturally-based limitation...

You have provided no evidence or reasoning for that conclusion. The river crossing puzzle is exactly what I had in mind when talking about specific domains. It is a common trick question with little to no variation and LLMs have overfit on that specific form of the problem. Translate it to any other version - say transferring potatoes from one pot to the next, or even a mathematical description of sets being modified - and the models do just fine. This is like tricking a human with the "As I was going to Saint Ives" question, exploiting their expectation of having to do arithmetic because it looks superficially like a math problem, and then concluding that they are fundamentally unable to reason.

> People like Demis Hassabis (CEO of DeepMind) acknowledge the weakness too.

What weakness? That current LLMs aren't as good as humans when reasoning over certain domains? I don't follow him personally but I doubt he would have the confidence to make any claims about fundamental inabilities of the transformer architecture. And even if he did, I could name you a couple of CEOs of AI labs with better models that would disagree, or even Turing award laureates. This is by no means a consensus stance in the expert community.

HarHarVeryFunny 655 days ago

> And even if he did, I could name you a couple of CEOs of AI labs with better models that would disagree, or even Turing award laureates. This is by no means a consensus stance in the expert community.

I disagree - there is pretty widespread agreement that reasoning is a weakness, even among the best models, (and note Chollet's $1M ARC prize competition to spur improvements), but the big labs all seem to think that post-training can fix it. To me this is whack-a-mole wishful thinking (reminds me of CYC - just add more rules!). At least one of your "Turing award laureates" thinks Transformers are a complete dead end as far as AGI goes.

We'll see soon enough who's right.

accountnum 655 days ago

A weakness of the current models in some domains considered useful, yes - but not a fundamental limitation of the architecture. I see no consensus on the latter whatsoever.

The ARC challenge tests spatial reasoning, something we humans are obviously quite good at, given 4 billion years of evolutionary optimization. But as I said, there is no "general reasoning", it's all domain dependent. A child does better at the spatial problems in ARC given that it has that previously mentioned evolutionary advantage, but just as we don't worship calculators as superior intelligences because they can multiply 10^9 digit numbers in milliseconds, we shouldn't draw fundamental conclusions from humans doing well at a problem that they are in many ways built to solve. If the failures of previous predictions - those that considered Chess or Go as unmistakable signals of true general reasoning - have taught us anything, it's that general reasoning simply does not exist.

The bet of current labs is synthetic data in pre-training, or slight changes of natural data that induces more generalization pressure for multi-step transformations on state in various domains. The goal is to change the data so models learn these transformations more readily and develop good heuristics for them, so not the non-continuous patching that you suggest.

But yes, the next generation of models will probably reveal much more about where we're headed.

HarHarVeryFunny 656 days ago

You are confusing number of sequential steps with total amount of compute spent.

The input sequence is processed in parallel, regardless of length, so number of tokens has no impact on number of sequential compute steps which is always N=layers.

> Do you know what a "language model" is capable of in the limit ?

Well, yeah, if the language model is an N-layer transformer ...

famouswaffles 656 days ago

Fair Enough.

Then increase N (N is almost always increased when a model is scaled up) and train or write things down and continue.

A limitless iteration machine (without external aid) is currently an idea of fiction. Brains can't do it so I'm not particularly worried if machines can't either.

HarHarVeryFunny 656 days ago

Increasing number of layers isn't a smart way to solve it. It order to be able to reason effectively and efficiently the model needs to use as much, or as little, compute as needed for a given task. Completing "1+1=" should take less compute steps than "A winning sequence for white here is ...".

This lack of "variable compute" is a widely recognized shortcoming of transformer-based LLMs, and there are plenty of others. The point apropos this thread is that you can't just train an LLM to be something that it is not. If the generating process required variable compute (maybe 1000's of steps) - e.g. to come up with a chess move - then no amount of training can make the LLM converge to model this generative process... the best it can do is to model the outcome of the generative process, not the process itself. The difference is that without having learnt the generative process, the model will fail when presented with a novel input that it didn't see during training, and therefore didn't memorize the "cheat sheet" answer for.

famouswaffles 656 days ago

>Increasing number of layers isn't a smart way to solve it.

The "smart way" is a luxury. Solving the problem is what matters. Think of a smart way later if you can. That's how a lot of technological advancement has worked.

>It order to be able to reason effectively and efficiently the model needs to use as much, or as little, compute as needed for a given task. Completing "1+1=" should take less compute steps than "A winning sequence for white here is ...".

Same thing. Efficiency is nice but a secondary concern.

>If the generating process required variable compute (maybe 1000's of steps) - e.g. to come up with a chess move - then no amount of training can make the LLM converge to model this generative process.

Every inference problem has itself a fixed number of compute steps it needs (yes even your chess move). Variability is a nice thing for between inferences(maybe move 1 required 500 but 2 only 240 etc) A nice thing but never a necessary thing.

3.5-turbo-instruct plays chess consistently at 1800 Elo so clearly the N of the current SOTA is already enough to play non-trivial chess at a level beyond most humans.

There is an N large enough for every GI problem humans care about. Not to sound like a broken record but once again, limited =/ trivial.