Hacker News new | ask | show | jobs
by aithrowaway1987 656 days ago
In 2022 Ilya Sutskever claimed there wasn't a distinction:

> It may look—on the surface—that we are just learning statistical correlations in text. But it turns out that to ‘just learn’ the statistical correlations in text, to compress them really well, what the neural network learns is some representation of the process that produced the text. This text is actually a projection of the world.

(https://www.youtube.com/watch?v=NT9sP4mAWEg - sadly the only transcripts I could find were on AI grifter websites that shouldn't be linked to)

This is transparently false - newer LLMs appear to be great at arithmetic, but they still fail basic counting tests. Computers can memorize a bunch of symbolic times tables without the slightest bit of quantitative reasoning. Transformer networks are dramatically dumber than lizards, and multimodal LLMs based on transformers are not capable of understanding what numbers are. (And if Claude/GPT/Llama aren't capable of understanding the concept of "three," it is hard to believe they are capable of understanding anything.)

Sutskever is not actually as stupid as that quote suggests, and I am assuming he has since changed his mind.... but maybe not. For a long time I thought OpenAI was pathologically dishonest and didn't consider that in many cases they aren't "lying," they blinded by arrogance and high on their own marketing.

4 comments

> But it turns out that to ‘just learn’ the statistical correlations in text, to compress them really well, what the neural network learns is some representation of the process that produced the text

This is pretty sloppy thinking.

The neural network learns some representation of a process that COULD HAVE produced the text. (this isn't some bold assertion, it's just the literal definition of a statistical model).

There is no guarantee it is the same as the actual process. A lot of the "bow down before machine God" crowd is guity of this same sloppy confusion.

It's not sloppy. It just doesn't matter in the limit of training.

1. An Octopus and a Raven have wildly different brains. Both are intelligent. So just the idea that there is some "one true system" that the NN must discover or converge on is suspect. Even basic arithmetic has numerous methods.

2. In the limit of training on a diverse dataset (ie as val loss continues to go down), it will converge on the process (whatever that means) or a process sufficiently robust. What gets the job done gets the job done. There is no way an increasingly competent predictor will not learn representations of the concepts in text, whether that looks like how humans do it or not.

> whether that looks like how humans do it or not.

So you agree with me that there is no guarantee it learns any representation of the actual process that produced the training data.

Sure I agree. But if that's what you're getting hung up on, i think you've missed his point entirely.

Whether the machines becomes a human brain clone or something entirely alien is irrelevant. The point is, you can't cheat reality. Statistics is not magic. You can't predict text that understands without understanding.

Sure you can, and if your predictive engine doesn't have the generality and power of the original generative one, then you have no choice.

Machine learning isn't magic - the model will learn what it can to minimize the error over the specific provided loss function, and no more. Change the loss function and you change what the model learns.

In the case of an LLM trained with a predict next word loss function, what you are asking/causing the model to learn is NOT the generative process - you are asking it to learn the surface statistics of the training set, and the model will only learn what it needs to (and is able to, per the model architecture being trained) in order to do this.

Now of course learning the surface statistics well does necessitate some level of "understanding" - are we dealing with a fairy tale or a scientific paper for example, but there is only so much the model can do. Chess is a good example, since it's easy to understand. The generative process for world class chess (whether human, or for an engine) involves way more DEPTH (cf layers) of computation than the transformer has available to model it, so the best it can do is to learn the surface statistics via much shallower pattern recognition of the state of the board. Now, given the size of these LLMs, if trained on enough games they will be able to play pretty well even using this pattern matching technique, but one doesn't need to get too far into a chess game to reach a position that has never been seen before in recorded games (e.g. watch agadmator's YouTube chess channel - he will often comment when this point has been reached), and the model therefore has no choice but to play moves that were seen in the training set in similar, but not identical positions... This is basically cargo-cult chess! It's interesting that LLMs can reach the ELO level that they do (says more about chess than about LLMs), but this same "cargo-cult" (follow surface statistics) generation process when out of training set applies to all inputs, not just chess...

>the model will learn what it can to minimize the error over the specific provided loss function, and no more. Change the loss function and you change what the model learns.

You clearly do not really understand what it means to predict internet scale text with increasing accuracy. No more than that ? Fantastic

LLMs do not just learn surface statistics. So many papers have thoroughly disabused this that i'm just not going to bother. This is just straight up denial.

This havs been evidently shown in chess as well. https://arxiv.org/abs/2403.15498v2

You have no idea what you are talkin about. You've probably never even played with 3.5-turbo-instruct. That's how you can say this nonsense. You have your conclusion and keep working backwards to get a justification.

>It's interesting that LLMs can reach the ELO level that they do (says more about chess than about LLMs)

When you say this for everything LLMs can do then it just becomes a meaningless cope statement.

No amount of training would cause a fly brain to be able to do what an octopus or bird brain can, or to model their behavioral generating process.

No amount of training will cause a transformer to magically sprout feedback paths or internal memory, or an ability to alter it's own weights, etc.

Architecture matters. The best you can hope for an LLM is that training will converge on the best LLM generating process it can be, which can be great for in-distribution prediction, but lousy for novel reasoning tasks beyond the capability of the architecture.

>No amount of training would cause a fly brain to be able to do what an octopus or bird brain can, or to model their behavioral generating process.

Go back a few evolutionary steps and sure you can. Most ANN architectures basically have relatively little to no biases baked in and the Transformer might be the most blank slate we've built yet.

>No amount of training will cause a transformer to magically sprout feedback paths or internal memory, or an ability to alter it's own weights, etc.

A transformer can perform any computation it likes in a forward pass and you can arbitrarily increase inference compute time with the token length. Feedback paths? Sure. Compute inefficient? Perhaps. Some extra programming around the Model to facilitate this ? Maybe but the architecture certainly isn't stopping you.

Even if it couldn't, limited =/ trivial. The Human Brain is not Turing complete.

Internal Memory ? Did you miss the memo ? Recurrency is overrated. Attention is all you need.

That said, there are already state keeping language model architectures around.

Altering weights ? Can a transformer continuously train ? Sure. It's not really compute efficient but architecture certainly doesn't prohibit it.

>Architecture matters

Compute Efficiency? Sure. What it is capable of learning? Not so much

> A transformer can perform any computation it likes in a forward pass

No it can't.

A transformer has a fixed number of layers - call it N. It performs N sequential steps of computation to derive it's output.

If a computation requires > N steps, then a transformer most certainly can not perform it in a forward pass.

FYI, "attention is all you need" has the implicit context of "if all you want to build is a language model". Attention is not all you need if what you actually want to build is a cognitive architecture.

Transformer produce the next token by manipulating K hidden vectors per layer, one vector per preceding token. So yes you can increase compute length arbitrarily by increasing tokens. Those tokens don't have to carry any information to work.

https://arxiv.org/abs/2310.02226

And again, human brains are clearly limited in the number of steps it can compute without writing something down. Limited =/ Trivial

>FYI, "attention is all you need" has the implicit context of "if all you want to build is a language model".

Great. Do you know what a "language model" is capable of in the limit ? No

These top research labs aren't only working on Transformers as they currently exist but it doesn't make much sense to abandon a golden goose before it has hit a wall.

How about spiders intelligence? They don’t even have brain
> In the limit of training on a diverse dataset (ie as val loss continues to go down), it will converge on the process (whatever that means) or a process sufficiently robust.

This is just moving the goal posts from "learning the actual process" to "any process sufficiently robust"

I didn't move anything because last i checked the term was Artificial Intelligence not Artificial exactly as a human does Intelligence
A photograph is not the same as its subject, and it is not sufficient to reconstruct the subject, but it's still a representation of the subject. Even a few sketched lines are something we recognise as a representation of a physical object.

I think it's fair to call one process that can imitate a more complex one a representation of that process. Especially when in the very next sentence he describes it as a "projection", which has the mathematical sense of a representation that loses some dimensions.

> I think it's fair to call one process that can imitate a more complex one a representation of that process

I think it's sloppy.

YeS, exactly. The trick is to have enough tough data so you find optimal one. I think as we will scale models back to smaller sizes we will discover viable/correct representations
Which basic counting tests do they still fail? Recent examples I've seen fall well within the range of innumeracy that people routinely display. I feel like a lot of people are stuck in the mindset of 10 years ago, when transformers weren't even invented yet and state-of-the-art models couldn't identify a bird, no matter how much capabilities advance.
> Recent examples I've seen fall well within the range of innumeracy that people routinely display.

Here's GPT-4 Turbo in April botching a test almost all preschoolers could solve easily: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr...

I have not used LLMs since 2023, when GPT-4 routinely failed almost every counting problem I could think of. I am sure the performance has improved since then, though "write an essay with 250 words" still seems unsolved.

The real problem is that LLM providers have to play a stupid game of whack-a-mole where an enormous number of trivial variations on a counting problem need to be specifically taught to the system. If the system was capable of true quantitative reasoning that wouldn't be necessary for basic problems.

There is also a deception is that "chain of thought" prompting makes LLMs much better at counting. But that's cheating: if the LLM had quantitative reasoning it wouldn't need a human to indicate which problems were amenable to step-by-step thinking. (And this only works for O(n) counting problems, like "count the number of words in the sentence." CoT prompting fails to solve O(nm) counting problems like "count the number of words in this sentence which contain the letter 'e'" For this you need a more specific prompt, like "First, go step-by-step and select the words which contain 'e.' Then go step-by-step to count the selected words." It is worth emphasizing over and over that rats are not nearly this stupid, they can combine tasks to solve complex problems without a human holding their hand.)

I don't know what you mean by "10 years ago" other than a desire to make an ad hominem attack about me being "stuck." My point is that these "capabilities" don't include "understands what a number is in the same way that rats and toddlers understand what numbers are." I suspect that level of AI is decades away.

Your test does not make any sense whatsoever because all GPT does when it creates an image currently is send a prompt to Dalle-3.

Beyond that LLMs don't see words or letters (tokens are neither) so some counting issues are expected.

But it's not very surprising you've been giving tests that make no sense.

> Recent examples I've seen fall well within the range of innumeracy that people routinely display.

But the company name specifically says "superintelligence"

The company isn't named "as smart as the average redditor, Inc"

Right. They don't think that state-of-the-art models are already superintelligent, they're aiming to build one that is.
> newer LLMs appear to be great at arithmetic, but they still fail basic counting tests

How does the performance of today's LLMs contradict Ilya's statement?

Because they can learn a bunch of symbolic formal arithmetic without learning anything about quantity. They can learn

  5 x 3 = 15
without learning

  *****    ****     *******
  ***** =  *****  = *******
  *****    ******   *
And this generalizes to almost every sentence an LLM can regurgitate.
The latter can be learned from "statistical correlations in text", just like Ilya said.
Yeah, it's not clear what companies like OpenAI and Anthropic mean when they predict AGI coming out of scaled up LLMs, or even what they are really talking about when they say AGI or human-level intelligence. Do they believe that scale is all you need, or is it an unspoken assumption that they're really talking about scale plus some set of TBD architectural/training changes?!

I get the impression that they really do believe scale is all you need, other than perhaps some post-training changes to encourage longer horizon reasoning. Maybe Ilya is in this camp, although frankly it does seem a bit naive to discount all the architectural and operational shortcomings of pre-trained Transformers, or assume they can be mitigated by wrapping the base LLM in an agent that provides what's missing.