Hacker News new | ask | show | jobs
by mjburgess 1095 days ago
It's a surprise to see a paper actually try to solve the problem of modelling thought via language.

Nevertheless, it begins with far too many hedges:

> By scaling to even larger datasets and neural networks, LLMs appeared to learn not only the structure of language, but capacities for some kinds of thinking

There's two hypotheses for how LLMs generate apparently "thought-expressing" outputs: Hyp1 -- it's sampling from similar text which is distributed so-as-to-express a thought by some agent; Hyp2 -- it has the capacity to form that thought.

It is absolutely trivial to show Hyp2 is false:

> Current LLMs can produce impressive results on a set of linguistic inputs and then fail completely on others that make trivial alterations to the same underlying domain.

Indeed: because there're no relevant prior cases to sample from in that case.

> These issues make it difficult to evaluate whether LLMs have acquired cognitive capacities such as social reasoning and theory of mind

It doesnt. It's trivial: the disproof lies one sentence above. Its just that many don't like the answer. Such capacities survive trivial permutations -- LLMs do not. So Hypothesis-2 is clearly false.

9 comments

>It is absolutely trivial to show Hyp2 is false

No it's not

> Current LLMs can produce impressive results on a set of linguistic inputs and then fail completely on others that make trivial alterations to the same underlying domain.

>Indeed: because there're no relevant prior cases to sample from in that case.

That's not what that tells us. Humans have weird failure modes that look absurd outside the context of evolutionary biology (some still look absurd) and that don't speak to any lack or presence of intelligence or complex thought. Not sure why it's so hard to grasp that LLMs are bound to have odd failure modes regardless of the above.

and trivial here is relative. In my experience, "trivial" often turns out to be trivial in the way a person may not pay close attention to and be similarly tricked.

For instance, GPT-4 might solve a classic puzzle correctly then fail the same puzzle subtlety changed. I've found more often than not, simply changing names of variables in the puzzle to something completely different can get it to solve the changed puzzle. It takes memory shortcuts but can be pulled out of that. LLMs have failure modes that look like human failure modes too.

The "failure modes" in humans do not show we lack the capacity.

Eg., do you have capacity to reason about physics? Well if you're extremely drunk, less so. But not if I permute the name of the object.

> I've found more often than not, simply changing names of variables

Yes, lol --- why do you think that is?

Because in the digitised dataset of "everything ever written" those names correspond to places in that dataset that can be sampled from by the LLM. Showing Hyp1 to be the case.

P(Hyp1| ChangeNameMakesDifference) >>>>>> P(Hyp2|ChangeNameMakesDifference)

To such a degree that the latter is vanishingly close to zero.

>The "failure modes" in humans do not show we lack the capacity.

Then they don't in LLMs too

>Yes, lol --- why do you think that is?

Being able to solve a changed common puzzle but also with different names than it would ever see in training is not an indication of a lack of ability lol. and changing names isn't the only way to get it out of memory, just the easiest/most straightforward. You can converse it out of there too but that doesn't work as often.

> Then they don't in LLMs too

LLMs don't get drunk .

If a child answers questions from a book of answers then they'll appear to understand the domain insofar as those questions appear. They do not.

They will fail to answer questions under, eg., permutations of words (say, a question asks about "norepinephrine" but the book only contains "noradrenaline" etc.).

Insofar as a human cannot answer questions under trivial linguistic permutations then they too do not understand the domain.

But these are not the kinds of failures experienced with those who have some capacity, eg., for counter-factual reasoning about their environment's physics.

In those people it is environmental illusion and cognitive impairment -- not trivial permutations of phrasing which lead to catastrophic loss of apparent understanding.

Cognitive impairment = reasoning machine is broken

Environmental illusion = data is ambigious and actions cannto resolve it

These "failure modes" are expected if you actually have the relevant capacity.

> LLMs don't get drunk .

Well actually they sort of can...

https://www.reddit.com/r/LocalLLaMA/comments/13vv941/tempera...

>Insofar as a human cannot answer questions under trivial linguistic permutations then they too do not understand the domain.

alright let me humor you for a bit. Lets start with some solid examples of GPT-4 failing this "trivial linguistic permutation" then ?

see, just one reference in the paper: https://arxiv.org/pdf/2302.08399.pdf
>> alright let me humor you for a bit.

That's real class, right there.

> less so. But not if I permute the name of the object.

You need to realize that you wrote it on a forum where the most known joke is "there are two hard things in programming". That would immediately show you how this assumption is exactly false.

This is a false dichotomy. It's not the case that models are truly capable of reasoning if and only if they are insensitive to irrelevant perturbations to input. In other words, the mere fact that sensitivity to names sometimes causes significant degradations in model performance doesn't mean that we've observed models are incapable of anything we might call "reasoning"—leaving aside the matter of how we'd define that.
I didnt say "if and only if" -- this is a conceptual analysis condition which applies only under deductive analysis.

I am using science, ie., abduction, to compare a class of hypotheses.

P(CapacityToThink| DegradingPermutations, ModelDrawsFromHistoricalCases)

is much much much lower than,

P(-CapacityToThink| DegradingPermutations, ModelDrawsFromHistoricalCases)

This might be a naive question, but here me out. Do we really know what the difference is between statistics and the capacity to think? Is "true understanding" rather a continuum of sophistication from a simple adder to Albert Einstein?

My point here isn't "if it quacks like a duck...", but more so that while we are talking about intelligent apparatus we should be comparing apples to apples, and not say "this is a mere engine and that is a living brain".

Idk, that isn't the sense I got from "It is absolutely trivial to show Hyp2 is false", but sure, I agree with you that this evidence certainly ought to tip the scales one way and not the other.
was with you until:

> look absurd outside the context of evolutionary biology

for humans, everything (everthing) is within the context of evolutionary biology!

> LLMs have failure modes that look like human failure modes too.

Yes - because LLM's are trained on 2020 Reddit.

>for humans, everything (everthing) is within the context of evolutionary biology!

Sure but if some alien species were observing us, some of our actions would look downright odd. Evolutionary biology doesn't necessarily hold the same reference frame for other species, even on earth. Octopi are weird to us. Not so much to other Octopi.

>Yes - because LLM's are trained on 2020 Reddit.

I wasn't making any comment on why this was the case. Simply that it was. There'll be failure models LLMs adopt from training data, but there's also bound to be failure modes LLMs adopt from the training scheme itself.

> It is absolutely trivial to show Hyp2 is false

To investigate precisely this question in a clear and unambiguous way, I trained an LLM from scratch to sort lists of numbers. It learned to sort them correctly, and the entropy is such that it's absolutely impossible that it could have done this by Hyp1 (sampling from similar text in the training set).

https://jbconsulting.substack.com/p/its-not-just-statistics-...

Now, there is room to argue that it applies a world-model when given lists of numbers with a hidden logical structure, but not when given lists of words with a hidden logical structure, but I think the ball is in your court to make that argument. (And to a transformer, it only ever sees lists of numbers anyway).

Your model is not sorting correctly and it sure has not learned any "algorithm". At best it has learned to approximate a sorting algorithm. That's what statistical machine learning models do, they are function approximators; not program learners.

Also, Machine Learning 101: you test your models on a test set that is disjoint to the training set. To clarify, we do this not because it's in the book and that's the rules, but because, by testing the model on held-out data, we can predict the error the model will have on unseen data (i.e. data not available to the experimenter). And we do this because under PAC-Learning assumptions a learner is said to learn a concept when it can correctly label instances of the concept with some probability of some error. In real-world situations we do not know the true concept, so we test on held-out data to approximate the probability of error.

Bottom line, if you train a model to do a thing and you don't test it carefully to figure out its error, you might claim it's learned something, but in truth, you have no idea what it's learned.

(To clarify: you tested on the train data assuming there's a low probability of overlap. Don't do that if you're trying to understand what your models can do).

> it sure has not learned any "algorithm". At best it has learned to approximate a sorting algorithm. That's what statistical machine learning models do, they are function approximators; not program learners.

Transformers are RASP programs, which includes sorting programs. See the Weiss paper (https://arxiv.org/pdf/2106.06981.pdf).

> Also, Machine Learning 101: you test your models on a test set that is disjoint to the training set. To clarify, we do this not because it's in the book and that's the rules, but because, by testing the model on held-out data, we can predict the error the model will have on unseen data

The probability of a test list existing in the training set is less than 10^-70.

>> Transformers are RASP programs, which includes sorting programs. See the Weiss paper (https://arxiv.org/pdf/2106.06981.pdf).

That's one preprint on arxiv, that makes a wild claim about a new concept that they acronymise as "RASP". It's not any kind of established terminology, nor is it anything but a claim.

What is certainly established is that a function, and an algorithm, are different objects. To clarify, a function is a mapping between the elements of two sets, whereas an algorithm is a sequence of operations that calculates the result of a function and is guaranteed to terminate. Algorithms are also typically understood to be provably correct and to have some provable asymptotic complexity (as opposed, for example, to heuristics) but that's not a requirement.

So for example, if you have a function ƒ between sets X and Y, and an algorithm P that calculates the result of ƒ, then you can give any element of X to P and it will return (in fact, construct) an element of Y. Crucially, ƒ is not P, and P is not ƒ.

Now, when you train a machine learning model, you are typically training a function ƒ̂ (with a little hat) to approximate ƒ. That means that your trained ƒ̂ is a function that maps some of the elements of X to the same elements of Y as ƒ, but not all. It's an approximation. So you get some amount of error, as in your experiment.

So what you've done in your experiment is that you trained a model to approximate a mapping between the set of lists, to itself (where the input list is any of the lists in your training set and the output is the same list, sorted). Your model is not an algorithm, and you cannot train an algorithm with a language model.

I appreciate that, learning an algorithm, is what you wanted to achieve, but in science we don't choose the answer that pleases us, we choose the answer that makes the most sense- and a good heuristic for that is that the answer that makes more sense is the simplest one. Here, in order to convince yourself that you have trained a language model to learn an algorithm, rather than an approximator, you have chosen to rely on a preprint with a completely novel and untested concept that someone put on the internet, rather than the well-understood abstractions of elementary computer science, so not at all the simplest explanation. That is not a good idea. You will not understand what is going on, if you rely on that kind of explanation. I assume you are trying to understand?

Edit: incidentally, you don't need a transformer to train an approximator to a sorting function. You can do that with a multi-layer perceptron, or a logistic regression, certainly with an LSTM. Ceteris paribus, you'll get the same results.

>> The probability of a test list existing in the training set is less than 10^-70.

But the same probability if you held the test set out would be 0, so why not do that? It's not hard to do.

Is there a good reason not to do that?

Btw, lists are composite objects. How much overlap is there between your training and test lists? Do you know?

Edit: meh. HN messes up my nice f-with-hook-and-combining-circumflex-accent. DAAAAANG!!!!

> that's one preprint on arxiv, that makes a wild claim about a new concept that they acronymise as "RASP". It's not any kind of established terminology, nor is it anything but a claim.

Would you change your mind for a different link, like this one? http://proceedings.mlr.press/v139/weiss21a.html

I think you would enjoy learning about RASP, rather than taking such a hardline skeptical position.

> a function is a mapping between the elements of two sets, whereas an algorithm is a sequence of operations that calculates the result of a function and is guaranteed to terminate

I'm aware. Transformers (and RASP programs) are guaranteed to terminate; that's one of their nice properties.

> Is there a good reason not to do that?

Balanced against the value of my unpaid time, a probability of 10^-70 is low enough for the purposes of a quick and fun test.

Speaking of which, I'm going to enjoy my weekend now. I hope you enjoy yours!

That's the same paper.
So this is a really good starting point -- but you havent formulated any hypotheses that can be tested. You've just looked at the graph and "reckoned something".

Formally, what hypotheses are you comparing? What do you think the specific hypothesis of the "AI = stats" person is? It isnt that the NN literally remembers data tokens, right?

In any case:

The issue with forcing NNs to model mathematical features is that the structure of the data itself has those properties. So the distributional hypothesis is true for sorting ordinals.

But it's really obviously false for natural language. The properties of the world are not the properties of word order... being red isnt "red follows words like...".

> you havent formulated any hypotheses that can be tested. You've just looked at the graph and "reckoned something"

Let's not be so hasty. I think I do put it as clearly as possible. I'm comparing essentially your Hyp1 and Hyp2, where Hyp1 (aka the stochastic parrot) is expressed a little bit more clearly as the LLM is learning an n-gram that produces correct sorts through rote memorization of statistical correlations in the training data, like that sorted lists tend to start with '0', end with '99', and increase monotonically; and Hyp2 is that the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list.

> But it's really obviously false for natural language. The properties of the world are not the properties of word order... being red isnt "red follows words like..."

This is not really obviously false. Yes, being red isn't "red follows words like...". But a word order should still map to properties of the world, especially if those words are to be meaningful to a listener. Being red is "a surface reflects or transmits most of the light in the 600-800 nm spectrum and absorbs most of the rest". Of course, it won't do to just echo those tokens; once you've nailed down the concept of "red", you need to make sure that concepts like "reflects", "light", and "spectrum" are represented as well. It's an open question as to whether this sort of knowledge graph can be properly bootstrapped from a large volume of text descriptions, but I am strongly inclined to believe it can. If you dismiss it outright you're just begging the question.

There are an infinite number of sentences which describe what "being red" is, most of them have never been written.

Redness is not in the structure of those sentences. And there will always be an infinity of sentences which are True but cannot be infered by an LLM -- but can be so, trivially, by a person acquainted with redness.

In any case,

I'd need more time than I have at the moment to seriously state Hyp1 for your case -- but atm, I can say that because the data itself has the property, Hyp1 becomes much harder to state and the argument much subtler.

Since what is a "statistical distribution" of "ordinals" anyway? And how much memory is required to represent it? My sense is this distribution has highly redundant features which will be trivially compressible without learning any "sorting algorithm".

At a quick glance of your article it feels like you havent formulated Hyp1 correctly -- P(CorrectSort | f(HistoricalCases)) is perhaps arbitrarily high if some statistical f() can be chosen well.

> There are an infinite number of sentences which describe what "being red" is, most of them have never been written.

Which is exactly how the set of sentences actually written encodes in it the idea of "Redness". It's the "actually written" part that carries information about the real world.

> And there will always be an infinity of sentences which are True but cannot be infered by an LLM -- but can be so, trivially, by a person acquainted with redness.

That's cheating, because "a person acquainted with redness" presumably learned it by sight, which LLMs can't do just yet (at least the widely accessible ones can't). Would you also say that a person born blind also cannot infer those True sentences about redness? Because if they can, that means the concept of redness is capable of being taught through language, and so there's no reason LLMs couldn't pick up on it too.

> Redness is not in the structure of those sentences.

Sure; it's in the spectrum of reflected light. (Or perhaps, the retina's trichromal responsivity). But that physical concept can be meaningfully described by sentences. It doesn't require an infinite number of them to create a coherent world-model, which can do things like predicting that a blue object will become red if it moves away from you at a high enough speed. Which is something a human might be surprised by even after many years of visual experience with red objects -- unless they've read sentences about the Doppler effect in a physics textbook.

If you can manage to trick GPT-4 into revealing that it doesn't have a world-model of the concept of 'red', please show us!

> At a quick glance of your article it feels like you havent formulated Hyp1 correctly -- P(CorrectSort | f(HistoricalCases)) is perhaps arbitrarily high if some statistical f() can be chosen well.

Keep in mind, the LLM's structure was not hand-crafted to do well on this mathematical task. It was built to be good at language modelling, and initialized with essentially a uniform prior over all token sequences. Even if a dataset is efficiently compressible, that's no guarantee that the LLM will be able to compress it efficiently. In fact, many people would probably be surprised to learn that it can do this problem at all, let alone so well with so little training. But do think about the statistics of sorting a bit more. I think it's not as easily compressible as you think it is, except by an actual sorting algorithm. Again, you can compress it a bit with monotonicity and so on, but nowhere near the amount you'd need to sort a long list without errors, using so few parameters. I compute the number of sorted and unsorted lists in the footnotes.

One of the things that makes sorting tricky for an LLM is you always need to look at every item in the input list. Even if the previous output token was '99', you can't be sure you're now at the end of the list; you still need to count how many '99's were output already and how many are needed.

(The dataset itself, of course, does not contain the notion of sorting, a description of sorting, a test for sortedness, or any algorithm for sorting. It only contains a large but finite number of examples of sorted and unsorted lists. It's up to the LLM, and its training process, to discover the mechanism that generated these results.)

> that's no guarantee that the LLM will be able to compress it efficiently

Your LLM here is 600MB which is a grossly inefficient compression of the sort space.

If LLMs "learned algorithms", the best compression would be on the order of bytes.

The python to generate this list is c. 1kb -- and you're using an obscene 600MB to do it!

What do you think all those MBs are doing? They're the extraordinary cost of the "statistical shortcut" of modelling the empirical distribution of sorted numbers.

NNs exploit distributional structure in the training data to compress it --- in this case there's huge amounts of distributional structure in numbers.

I think you've misunderstood the "statistical parrot" claim to be somehow that NNs are engaged in wrote memorization... or, what?

The claim is simply that all they do is statistically approximate the empirical distribution of the training dataset structure --- and if you force interpolation, then they provide arbitrarily precise compressions of that structure.

I'm not sure what a NN which can sort numbers shows, other than the distributional structure of a sort-numbers dataset is such that a NN can compress it into 600MB...

To be clear, the "statistical parrot" claim is that the statistical distribution of the empirical dataset D = (X, y) is being approximated by the weights, W = Compress(D) -- and that this distribution fails to be a representational model of y -- because no entailments of X (other than those in D) are captured.

Whereas representational models are not confined to the distribution of historical cases, ie., I can imagine variations on X leading to any given y; and variations on y leading to any given X -- without ever having experienced either.

You're showing the system vast amounts of numbers being sorted, so it learns the distribution of that data, so it can replay those sorts.

I'm not exactly sure why you think this is a reply to the relevant claims.

I'm curious, why are you using "n-gram" as if you're referring to a model? You say e.g. that "LLM is learning an n-gram". N-grams are features, not models. You can train an n-gram model, or you can train a language model using n-grams as features, and so on, but you can't "learn an n-gram".

Where did you find this terminology?

EDIT:

>> and Hyp2 is that the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list.

Btw, you have not shown anything like that. You trained and tested on lists of two-digit positive integers expressible in 128 characters. That's not "any input list". As a for instance, what do you think would happen if you gave your model an alphanumeric list to sort? Did you try that?

Your model also doesn't correctly generalise, not even to its own training set that you tested it on. There's plenty of error in the figure where you show its accuracy (not clear if that's training or test accuracy).

It's not clear to me how you account for those obvious limitations of your model (it's a toy model after all) when you claim that it "learned to implement a sorting algorithm" etc. It would be great if you could clarify that.

> what do you think would happen if you gave your model an alphanumeric list to sort? Did you try that?

The tokenizer would throw an exception, because it doesn't have any tokens to represent alphabetical characters. But you tell me - if I had tokenized alphabetical characters and defined an ordering, would you expect the results to be any different?

> You say e.g. that "LLM is learning an n-gram"[...] you can't "learn an n-gram".

Where do I say that? I don't think I make any reference to "learning an n-gram", which is a relief because I don't know what it would mean to "learn an n-gram".

> There's plenty of error in the figure where you show its accuracy (not clear if that's training or test accuracy).

Test accuracy between training iterations (not part of the training process itself, which uses its own separate validation set which is split from the training set). And yes, I agree, it is not error-free, and I wouldn't expect it to be, especially after so little training. What the figure shows is the percentage of sorts that were error-free, and how rapidly that decreases. I've since repeated the test with finer resolution, and the fraction of imperfect sorts continues to decrease about as you expect, which is enough to satisfy my curiosity, although I'm a little curious to see if there is some point where it falls completely to zero.

>> Where do I say that?

In your comment above:

(...) is expressed a little bit more clearly as _the LLM is learning an n-gram_ that produces correct sorts (...)

(My underlining)

You also use it in a similarly unusual way throughout your linked substack post, for example, you write:

the way GPT works is, in a certain sense, functionally equivalent to an n-gram, but that doesn’t mean GPT is an n-gram.

Where does this use of "n-gram" come from? I mean, did you see it somewhere? I'm curious, where?

>> The tokenizer would throw an exception, because it doesn't have any tokens to represent alphabetical characters. But you tell me - if I had tokenized alphabetical characters and defined an ordering, would you expect the results to be any different?

I'm sorry, I don't understand. "Defined an ordering", where?

You can change your tokenizer but that will not change the trained model, obviously. So if you take your model that's trained on two-digit lists of integers and you run it on lists of any other type of elements it will not be able to sort them correctly. But isn't that what you claim? That:

"the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list"

"Any input list"? How so?

If it's "absolutely trivial" to show that LLMs don't have the capacity to form thought, then please publish a paper proving that. So all the "stupid" people studying LLMs that can't come up with such trivial proofs can move on to other stuff.
"I have a truly marvelous demonstration that LLMs don't have the capacity to form thought which this margin is too narrow to contain."
You may wish to read the paper above. But if you want a quick proof:

1. A thought is a representation of a situation

2. A representation generates entailments of that situation

3. Language is many-to-one translation from these representations to symbols

4. Understanding language is reversing these symbols into thoughts (ie., reprs)

So,

5. If agent A understands sentence X then A forms the relevant representation of X.

6. If agent has a representation it can state entailments of S (eg., counter-facutals).

Now, split X into Xc = "canonical descriptions of S" and trivial permutations Xp.

(st. distribution of Xc,Xp is low, but the tokens of Xp are common)

Form entailments of X, say Y -- sentences that are cannonically implied by the truth of X.

7. If the LLM understood that X entails Y, it would be via constructing the repr S -- which entails S regardless of which sentence in X was used.

8. Train an LLM on Xc and it's accuracy on judging Y entailed by Xp is random.

9. Since using Xp sentences cause it to fail, it does not predict Y via S.

QED.

And we can say,

1. Appearing to judge Y entailed-by X is possible via simple sampling of (X, Y) in historical cases. 2. LLMs are just such a sampling.

so,

3. +Inference to the best explanation:

4. LLMs sample historical cases rather than form representations.

Incidentally, "sampling of historical cases" is already something we knew -- so this entire argument is basically unnecessary. And only necessary because PhDs have been turned into start-up hype men.

> Train an LLM on Xc and it's accuracy on judging Y entailed by Xp is random.

Why? This is obviously wrong in general case. For that to be true Xp and Xc has to have no statistical relationship whatsoever, which statistically is virtually impossible.

Xp just have to be chosen such that the distribution Xc,Xp is sufficiently small in the training data -- but not that the tokens of Xp are themselves rare. So that an agent competent with tokens in X, who can construct repr of S, could do so with Xp.

Consider a reference in the paper above, https://arxiv.org/pdf/2302.08399.pdf

Xc = > Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says “chocolate” and not “popcorn.” Sam finds the bag. She had never seen the bag before. She cannot see what is inside the bag. She reads the label.

Produces, Y = She believes that the bag is full of popcorn

Xp = > Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is made of transparent plastic, so you can see what is inside. Yet, the label on the bag says ’chocolate’ and not ’popcorn.’ Sam finds the bag. She had never seen the bag before. Sam reads the label.

Produces, Y = She believes that the bag is full of chocolate

And so on, and so on...

> just have to be chosen such that the distribution Xc,Xp is sufficiently small in the training data -- but not that the tokens of Xp are themselves rare

Great idea. Now prove you can actually choose such a distribution, lol.

I think this is easy, just make Xp sentences of the kind = "I define `randomchars()` to be this `term-in-Xc()`" and swamp the dataset with Xc.

Everything here actually just follows formally from what NNs are: they're just empirical function approximations.

It will always be the case that they just model the probabilistic structure of the dataset and not the data generating process.

Since, in language, there are discrete constraints which make P(...) = 1 or P(...) = 0 --- you can trivially produce datasets showing that it learns P(...) = mistake-you-created-deliberately and not either 0,1.

As above, the LLM switches from 95% confidence "chocolate" to 95% confidence "popcorn" with a trivial non-semantic permutation of the prompt.

The obscene issue in all this is that we know this already -- empirical function approximation of historical datasets just produces associative probabilistic models of those datasets.

> 8. Train an LLM on Xc and it's accuracy on judging Y entailed by Xp is random.

This is clearly where the "proof" falls apart. Even in tasks where GPT4 struggles, it's accuracy will still be better than random. The bar of "better than random" is so low that even weak LLMs will be able to surpass it.

More so, you need to prove not just a single, but that no task/domain exists for which LLMs satisfy 8.

What your proof says is basically "LLMs do not generalize even the slightest for any task". And that's trivial to disprove.

I just need to be able to create a split in Xc,Xp so that Xp is random. I think that's really quite easy.

If you could put ChatGPT in a loop, take some Xc prompts and permute with some non-semantic phrases ("Alice believes that... Xc ... what did Alice believe?") etc --- until you find those cases.

I imagine we will discover quite a large number of such non-semantic phrases which have this effect. Because the tokens in those phrases will, joint with Xc, be arbitrarily distributed in some historical data (distributed to our preference when finding them).

This seems just kinda basically obvious, right? Entailments are discretely constrained by semantics, and historical datasets can contain arbitrary mixtures of random distributions of syntax.

NNs only model those distributions -- and not the entailments -- which, at the very least, are extremely discrete.

I don't think you really disproved anything. You're just saying another hypothesis. Often, LLMs produce impressive results on domains that aren't in the training set.
>LLMs produce impressive results on domains that aren't in the training set.

How do we know? Who knows what they're trained on?

> it's sampling from similar text which is distributed so-as-to-express a thought by some agent;

Your hypotheses 1 and 2 are not so different when you consider that the similarity function used to match text in the training data must be highly nontrivial. If it were not, then things like GPT-3 would have been possible a long time ago. As a concrete example, LLMs can do decent reasoning entirely in rot13; the relevant rot13'ed text is likely very rare in their training data. The fact that the similarity function can "see through" rot13 means that it can in principle include nontrivial computations.

> There's two hypotheses for how LLMs generate apparently "thought-expressing" outputs: Hyp1 -- it's sampling from similar text which is distributed so-as-to-express a thought by some agent; Hyp2 -- it has the capacity to form that thought.

There's also another hypothesis: Hyp3 -- that Hyp1 and Hyp2 converge as the LLM is scaled up (more training data, more dimensions in the latent space), and in the limit become equivalent.

They're indistinguishable via naive measurement (prompting) if the LLM can sample from all possible data: there's a very large infinity of (Q, A, time) triples (ie., it's real-valued).

But it cannot, since most of those are in the future.

Failing on "trivial alterations to the same underlying domain" is a not a disproof of thought.

Your argument also implies hyp1 and 2 are exclusive, clearly both can be true, and in fact must be true, unless you are claiming that you do not "sample" from similar language to express your own thoughts? Where does your language come from then, if not learning from previous experience?

While I agree with you on the relation of GP's Hyp1 and Hyp2, you are making an unfounded assumption of a sampling process being necessary to perform human speech. I do not believe we have the understanding of how thought is represented in the human brain to make that judgement. In other words, just because sampling from a distribution can produce human-like text does not mean that it is the only way to do that, and thus that it must be the way that humans produce text, spoken or written.
We might be talking about 2 different things. I was referring to the backwards learning pass and you seem to be referring to the forward inference pass, but what is an alternative to learning (or producing) text which does not involve sampling from some larger space? (Also I’m not a statistician so I’m not sure if these are technically “distributions”)
"Trivial to show" as in it's trivial to show that addition on uint8 doesn't work ie. 250+250?
Don't try to ham-fist scientific sounding wording into your (very unscientific) argument. This is not a disproof of anything because you failed to define what it means to have the ability to form rational thoughts. With a definition, you would then wanna prove this for humans as a sanity check: Do we never make stupid mistakes? Ok, we make fewer of those than LLMs. Then what is the threshold for accuracy after which you consider a system to be intelligent? Do all humans pass that threshold, or do kids or people with a lower than average IQ fail?
This entire paper is written as a disproof of the distributional hypothesis. If you want to understand why it's a profoundly unhelpful pseudoscientific idea, this paper is a good start.

The test for a capacity C in a system1 has nothing to do with proxy measures of that capacity in system2.

The capacity for an oven to cook food may be measured by how much smoke it lets of when burning -- but no amount of "smoke" establishes that a dry ice machine can cook.

This type of "engineering thinking" is pseudoscience.

>The capacity for an oven to cook food may be measured by how much smoke it lets of when burning -- but no amount of "smoke" establishes that a dry ice machine can cook.

You seem to be talking past me, as nowhere did I claim that LLMs are intelligent. That's the point – Unlike you I do not claim to be able to prove or disprove this. I argue that your comment is the one that is pseudoscientific because you didn't provide (even a semblance of) a rigorous definition of intelligence.

> humans

There is intelligent thought and action, and there is unintelligent thought and action. Intelligent is that "which checked" (intus-legere); the other, the """impulsive""", is not.