Hacker News new | ask | show | jobs
by jbay808 1096 days ago
> It is absolutely trivial to show Hyp2 is false

To investigate precisely this question in a clear and unambiguous way, I trained an LLM from scratch to sort lists of numbers. It learned to sort them correctly, and the entropy is such that it's absolutely impossible that it could have done this by Hyp1 (sampling from similar text in the training set).

https://jbconsulting.substack.com/p/its-not-just-statistics-...

Now, there is room to argue that it applies a world-model when given lists of numbers with a hidden logical structure, but not when given lists of words with a hidden logical structure, but I think the ball is in your court to make that argument. (And to a transformer, it only ever sees lists of numbers anyway).

2 comments

Your model is not sorting correctly and it sure has not learned any "algorithm". At best it has learned to approximate a sorting algorithm. That's what statistical machine learning models do, they are function approximators; not program learners.

Also, Machine Learning 101: you test your models on a test set that is disjoint to the training set. To clarify, we do this not because it's in the book and that's the rules, but because, by testing the model on held-out data, we can predict the error the model will have on unseen data (i.e. data not available to the experimenter). And we do this because under PAC-Learning assumptions a learner is said to learn a concept when it can correctly label instances of the concept with some probability of some error. In real-world situations we do not know the true concept, so we test on held-out data to approximate the probability of error.

Bottom line, if you train a model to do a thing and you don't test it carefully to figure out its error, you might claim it's learned something, but in truth, you have no idea what it's learned.

(To clarify: you tested on the train data assuming there's a low probability of overlap. Don't do that if you're trying to understand what your models can do).

> it sure has not learned any "algorithm". At best it has learned to approximate a sorting algorithm. That's what statistical machine learning models do, they are function approximators; not program learners.

Transformers are RASP programs, which includes sorting programs. See the Weiss paper (https://arxiv.org/pdf/2106.06981.pdf).

> Also, Machine Learning 101: you test your models on a test set that is disjoint to the training set. To clarify, we do this not because it's in the book and that's the rules, but because, by testing the model on held-out data, we can predict the error the model will have on unseen data

The probability of a test list existing in the training set is less than 10^-70.

>> Transformers are RASP programs, which includes sorting programs. See the Weiss paper (https://arxiv.org/pdf/2106.06981.pdf).

That's one preprint on arxiv, that makes a wild claim about a new concept that they acronymise as "RASP". It's not any kind of established terminology, nor is it anything but a claim.

What is certainly established is that a function, and an algorithm, are different objects. To clarify, a function is a mapping between the elements of two sets, whereas an algorithm is a sequence of operations that calculates the result of a function and is guaranteed to terminate. Algorithms are also typically understood to be provably correct and to have some provable asymptotic complexity (as opposed, for example, to heuristics) but that's not a requirement.

So for example, if you have a function ƒ between sets X and Y, and an algorithm P that calculates the result of ƒ, then you can give any element of X to P and it will return (in fact, construct) an element of Y. Crucially, ƒ is not P, and P is not ƒ.

Now, when you train a machine learning model, you are typically training a function ƒ̂ (with a little hat) to approximate ƒ. That means that your trained ƒ̂ is a function that maps some of the elements of X to the same elements of Y as ƒ, but not all. It's an approximation. So you get some amount of error, as in your experiment.

So what you've done in your experiment is that you trained a model to approximate a mapping between the set of lists, to itself (where the input list is any of the lists in your training set and the output is the same list, sorted). Your model is not an algorithm, and you cannot train an algorithm with a language model.

I appreciate that, learning an algorithm, is what you wanted to achieve, but in science we don't choose the answer that pleases us, we choose the answer that makes the most sense- and a good heuristic for that is that the answer that makes more sense is the simplest one. Here, in order to convince yourself that you have trained a language model to learn an algorithm, rather than an approximator, you have chosen to rely on a preprint with a completely novel and untested concept that someone put on the internet, rather than the well-understood abstractions of elementary computer science, so not at all the simplest explanation. That is not a good idea. You will not understand what is going on, if you rely on that kind of explanation. I assume you are trying to understand?

Edit: incidentally, you don't need a transformer to train an approximator to a sorting function. You can do that with a multi-layer perceptron, or a logistic regression, certainly with an LSTM. Ceteris paribus, you'll get the same results.

>> The probability of a test list existing in the training set is less than 10^-70.

But the same probability if you held the test set out would be 0, so why not do that? It's not hard to do.

Is there a good reason not to do that?

Btw, lists are composite objects. How much overlap is there between your training and test lists? Do you know?

Edit: meh. HN messes up my nice f-with-hook-and-combining-circumflex-accent. DAAAAANG!!!!

> that's one preprint on arxiv, that makes a wild claim about a new concept that they acronymise as "RASP". It's not any kind of established terminology, nor is it anything but a claim.

Would you change your mind for a different link, like this one? http://proceedings.mlr.press/v139/weiss21a.html

I think you would enjoy learning about RASP, rather than taking such a hardline skeptical position.

> a function is a mapping between the elements of two sets, whereas an algorithm is a sequence of operations that calculates the result of a function and is guaranteed to terminate

I'm aware. Transformers (and RASP programs) are guaranteed to terminate; that's one of their nice properties.

> Is there a good reason not to do that?

Balanced against the value of my unpaid time, a probability of 10^-70 is low enough for the purposes of a quick and fun test.

Speaking of which, I'm going to enjoy my weekend now. I hope you enjoy yours!

That's the same paper.
EDIT: I realise I was mistaken about the OP. He is not an undergarduate student, as I initially thought. His substack profile says he is a professional engineer and consultant. So his complete cluelessness about computer science fundamentals is not the result of inexperience, and his article is nothing more than an attempt to jump on the current bandwagon of LLM hype rather than an attempt to make sense of things. I thought I was helping a CS grad learn something! What an idiot I am! Fuck. φτου γαμώ την Παναγία μου.

[Earlier text of my comment left in the interest of something or other]

It is, but it's published in the proceedings of the ICML, which means it's been peer reviewed.

The OP has checked out (I guess all this computer scienc-y stuff is boring on a weekend), but even a peer-reviewed article is not enough to cause us to let go of good, old-fashioned computer science. The article basically invents its own language and then proceeds to map transformers to it, to claim that transformers can learn various kinds of programs. It's not convincing.

In any case, learning to sort lists by neural nets is not something new, or unique to transformers, and there's pretty clear understanding of how it works. I explain why it doesn't constitute learning an algorithm in my comment above. The RASP paper doesn't change that. I mean, Recurrent Neural Nets have a known equivalence to FSMs but even they cannot learn algorithms but only approximate them. The OP wrote his article in an obvious effort to understand why GPT is "not an n-gram" even if it behaves like an n-gram model (well, it's a language model, it doesn't matter what it's trained on) so I'm guessing he can appreciate the need for clarity in explaining empirical results and he probably will want to think further on what, exactly, his experiment has shown. I hope my little comment above will help him do that.

So this is a really good starting point -- but you havent formulated any hypotheses that can be tested. You've just looked at the graph and "reckoned something".

Formally, what hypotheses are you comparing? What do you think the specific hypothesis of the "AI = stats" person is? It isnt that the NN literally remembers data tokens, right?

In any case:

The issue with forcing NNs to model mathematical features is that the structure of the data itself has those properties. So the distributional hypothesis is true for sorting ordinals.

But it's really obviously false for natural language. The properties of the world are not the properties of word order... being red isnt "red follows words like...".

> you havent formulated any hypotheses that can be tested. You've just looked at the graph and "reckoned something"

Let's not be so hasty. I think I do put it as clearly as possible. I'm comparing essentially your Hyp1 and Hyp2, where Hyp1 (aka the stochastic parrot) is expressed a little bit more clearly as the LLM is learning an n-gram that produces correct sorts through rote memorization of statistical correlations in the training data, like that sorted lists tend to start with '0', end with '99', and increase monotonically; and Hyp2 is that the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list.

> But it's really obviously false for natural language. The properties of the world are not the properties of word order... being red isnt "red follows words like..."

This is not really obviously false. Yes, being red isn't "red follows words like...". But a word order should still map to properties of the world, especially if those words are to be meaningful to a listener. Being red is "a surface reflects or transmits most of the light in the 600-800 nm spectrum and absorbs most of the rest". Of course, it won't do to just echo those tokens; once you've nailed down the concept of "red", you need to make sure that concepts like "reflects", "light", and "spectrum" are represented as well. It's an open question as to whether this sort of knowledge graph can be properly bootstrapped from a large volume of text descriptions, but I am strongly inclined to believe it can. If you dismiss it outright you're just begging the question.

There are an infinite number of sentences which describe what "being red" is, most of them have never been written.

Redness is not in the structure of those sentences. And there will always be an infinity of sentences which are True but cannot be infered by an LLM -- but can be so, trivially, by a person acquainted with redness.

In any case,

I'd need more time than I have at the moment to seriously state Hyp1 for your case -- but atm, I can say that because the data itself has the property, Hyp1 becomes much harder to state and the argument much subtler.

Since what is a "statistical distribution" of "ordinals" anyway? And how much memory is required to represent it? My sense is this distribution has highly redundant features which will be trivially compressible without learning any "sorting algorithm".

At a quick glance of your article it feels like you havent formulated Hyp1 correctly -- P(CorrectSort | f(HistoricalCases)) is perhaps arbitrarily high if some statistical f() can be chosen well.

> There are an infinite number of sentences which describe what "being red" is, most of them have never been written.

Which is exactly how the set of sentences actually written encodes in it the idea of "Redness". It's the "actually written" part that carries information about the real world.

> And there will always be an infinity of sentences which are True but cannot be infered by an LLM -- but can be so, trivially, by a person acquainted with redness.

That's cheating, because "a person acquainted with redness" presumably learned it by sight, which LLMs can't do just yet (at least the widely accessible ones can't). Would you also say that a person born blind also cannot infer those True sentences about redness? Because if they can, that means the concept of redness is capable of being taught through language, and so there's no reason LLMs couldn't pick up on it too.

> Redness is not in the structure of those sentences.

Sure; it's in the spectrum of reflected light. (Or perhaps, the retina's trichromal responsivity). But that physical concept can be meaningfully described by sentences. It doesn't require an infinite number of them to create a coherent world-model, which can do things like predicting that a blue object will become red if it moves away from you at a high enough speed. Which is something a human might be surprised by even after many years of visual experience with red objects -- unless they've read sentences about the Doppler effect in a physics textbook.

If you can manage to trick GPT-4 into revealing that it doesn't have a world-model of the concept of 'red', please show us!

> At a quick glance of your article it feels like you havent formulated Hyp1 correctly -- P(CorrectSort | f(HistoricalCases)) is perhaps arbitrarily high if some statistical f() can be chosen well.

Keep in mind, the LLM's structure was not hand-crafted to do well on this mathematical task. It was built to be good at language modelling, and initialized with essentially a uniform prior over all token sequences. Even if a dataset is efficiently compressible, that's no guarantee that the LLM will be able to compress it efficiently. In fact, many people would probably be surprised to learn that it can do this problem at all, let alone so well with so little training. But do think about the statistics of sorting a bit more. I think it's not as easily compressible as you think it is, except by an actual sorting algorithm. Again, you can compress it a bit with monotonicity and so on, but nowhere near the amount you'd need to sort a long list without errors, using so few parameters. I compute the number of sorted and unsorted lists in the footnotes.

One of the things that makes sorting tricky for an LLM is you always need to look at every item in the input list. Even if the previous output token was '99', you can't be sure you're now at the end of the list; you still need to count how many '99's were output already and how many are needed.

(The dataset itself, of course, does not contain the notion of sorting, a description of sorting, a test for sortedness, or any algorithm for sorting. It only contains a large but finite number of examples of sorted and unsorted lists. It's up to the LLM, and its training process, to discover the mechanism that generated these results.)

> that's no guarantee that the LLM will be able to compress it efficiently

Your LLM here is 600MB which is a grossly inefficient compression of the sort space.

If LLMs "learned algorithms", the best compression would be on the order of bytes.

The python to generate this list is c. 1kb -- and you're using an obscene 600MB to do it!

What do you think all those MBs are doing? They're the extraordinary cost of the "statistical shortcut" of modelling the empirical distribution of sorted numbers.

NNs exploit distributional structure in the training data to compress it --- in this case there's huge amounts of distributional structure in numbers.

I think you've misunderstood the "statistical parrot" claim to be somehow that NNs are engaged in wrote memorization... or, what?

The claim is simply that all they do is statistically approximate the empirical distribution of the training dataset structure --- and if you force interpolation, then they provide arbitrarily precise compressions of that structure.

I'm not sure what a NN which can sort numbers shows, other than the distributional structure of a sort-numbers dataset is such that a NN can compress it into 600MB...

To be clear, the "statistical parrot" claim is that the statistical distribution of the empirical dataset D = (X, y) is being approximated by the weights, W = Compress(D) -- and that this distribution fails to be a representational model of y -- because no entailments of X (other than those in D) are captured.

Whereas representational models are not confined to the distribution of historical cases, ie., I can imagine variations on X leading to any given y; and variations on y leading to any given X -- without ever having experienced either.

You're showing the system vast amounts of numbers being sorted, so it learns the distribution of that data, so it can replay those sorts.

I'm not exactly sure why you think this is a reply to the relevant claims.

> The python to generate this list is c. 1kb -- and you're using an obscene 600MB to do it!

This isn't a fair comparison. The python code to sort a list is leveraging an enormous amount of information that is stored outside the python code, whereas the GPT version basically has to do it "from scratch", and in a very convoluted computing model.

A better comparison would be "how many bits does it take to encode a configuration of NAND gates that describes a computer that can sort 127-byte lists of number 1..100?"

I'm sure it's not as much as 600 megabytes, but it'll be a lot more than the python code.

> If LLMs "learned algorithms", the best compression would be on the order of bytes.

Yes. Except:

(1) the model size is fixed during training, it would be impossible to obtain a bytes-sized result regardless of what it learns to represent. One might even open the thing up and find bubblesort* inside followed by 599 MB of junk DNA; that size is dictated by how it was initialized.

(2) I'm not claiming this model is a minimal size; I started with the biggest model I could train on my wimpy GPU and succeeded on my first and only try, which I think is a fairer representation of how GPT-4 was built than if I'd started by proving the minimum size of transformer that could represent the task** and then (surprise!) obtained it.

(3) Compared with the size of a map of all 10^80 unique input lists to all 10^36 correctly-corresponding sorted outputs, 600 MB is a remarkable compression ratio, even if it's not reducing it all the way down to exec("sort(input)").

(4) Nowhere do I make any claim that transformers are minimal or even space-efficient representation of an algorithm (or a world-model); in fact, they seem quite terrible in this respect, especially compared to arbitrary code. And doubtless there are a bunch of weights that got trained to near-zero and could be trimmed to make the matrices more sparse, or quantized, which is the kind of thing people do to compress an LLM itself but I didn't bother. What transformers do seem to do very well at, despite the overhead, is the differentiability that allows them to be trained in the first place, and also the flexibility to handle different kinds of problems. I could have trained the same blank-slate starting model to one that shuffles or reverses each list, or perhaps to do one or the other depending on whether the first number is odd or even, or any number of other tasks.

> You're showing the system vast amounts of numbers being sorted, so it learns the distribution of that data, so it can replay those sorts.

It's almost definitely the case that every list it's tested on, and sorts 100% correctly, is a list it has never seen in training (unless it's a very short list, but I control for that). My training dataset is only about 100 MB; given the number of random lists, it's vanishingly unlikely that it's seen almost any of them, let alone the 100% of them that it is able to sort correctly. (The tests, of course, were not drawing from the validation set either; I test the model by generating new lists on the fly, because that's easy to do).

> statistically approximate the empirical distribution of the training dataset structure

Can you provide more details about what you mean by this distributional structure that can be compressed without a generally-correct sorting algorithm? How would you define a similarity measure between distinct random lists that allows for this kind of interpolation?

* Well, probably RASP-sort, not bubblesort. Also, it would need to include definitions of things like the comparison operator between all tokens, because it doesn't have a numeric datatype built in, or even the idea of numbers as an ordered set; it has to learn all that.

** (the Weiss paper does this, and lo and behold, transformers can indeed sort).

I'm curious, why are you using "n-gram" as if you're referring to a model? You say e.g. that "LLM is learning an n-gram". N-grams are features, not models. You can train an n-gram model, or you can train a language model using n-grams as features, and so on, but you can't "learn an n-gram".

Where did you find this terminology?

EDIT:

>> and Hyp2 is that the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list.

Btw, you have not shown anything like that. You trained and tested on lists of two-digit positive integers expressible in 128 characters. That's not "any input list". As a for instance, what do you think would happen if you gave your model an alphanumeric list to sort? Did you try that?

Your model also doesn't correctly generalise, not even to its own training set that you tested it on. There's plenty of error in the figure where you show its accuracy (not clear if that's training or test accuracy).

It's not clear to me how you account for those obvious limitations of your model (it's a toy model after all) when you claim that it "learned to implement a sorting algorithm" etc. It would be great if you could clarify that.

> what do you think would happen if you gave your model an alphanumeric list to sort? Did you try that?

The tokenizer would throw an exception, because it doesn't have any tokens to represent alphabetical characters. But you tell me - if I had tokenized alphabetical characters and defined an ordering, would you expect the results to be any different?

> You say e.g. that "LLM is learning an n-gram"[...] you can't "learn an n-gram".

Where do I say that? I don't think I make any reference to "learning an n-gram", which is a relief because I don't know what it would mean to "learn an n-gram".

> There's plenty of error in the figure where you show its accuracy (not clear if that's training or test accuracy).

Test accuracy between training iterations (not part of the training process itself, which uses its own separate validation set which is split from the training set). And yes, I agree, it is not error-free, and I wouldn't expect it to be, especially after so little training. What the figure shows is the percentage of sorts that were error-free, and how rapidly that decreases. I've since repeated the test with finer resolution, and the fraction of imperfect sorts continues to decrease about as you expect, which is enough to satisfy my curiosity, although I'm a little curious to see if there is some point where it falls completely to zero.

>> Where do I say that?

In your comment above:

(...) is expressed a little bit more clearly as _the LLM is learning an n-gram_ that produces correct sorts (...)

(My underlining)

You also use it in a similarly unusual way throughout your linked substack post, for example, you write:

the way GPT works is, in a certain sense, functionally equivalent to an n-gram, but that doesn’t mean GPT is an n-gram.

Where does this use of "n-gram" come from? I mean, did you see it somewhere? I'm curious, where?

>> The tokenizer would throw an exception, because it doesn't have any tokens to represent alphabetical characters. But you tell me - if I had tokenized alphabetical characters and defined an ordering, would you expect the results to be any different?

I'm sorry, I don't understand. "Defined an ordering", where?

You can change your tokenizer but that will not change the trained model, obviously. So if you take your model that's trained on two-digit lists of integers and you run it on lists of any other type of elements it will not be able to sort them correctly. But isn't that what you claim? That:

"the LLM's training molds it into representing an actual sorting algorithm that would correctly generalize to any input list"

"Any input list"? How so?

> In your comment above

Oh, I see, good catch. I think that comment was a result of a botched edit; I do that sometimes. Too late to change it now. Sorry for the confusion!

> Where does this use of "n-gram" come from? I mean, did you see it somewhere?

It's shorthand for n-gram Markov model. The same way it is presented in, for example, A Mathematical Theory of Communication.

> "Defined an ordering", where?

In order for a set to be sortable, you need to define an ordering over the elements. So for example, defining that the letter 'A' is greater than the number '99'. It's easy to take for granted that 1 < 2, but the neural network doesn't know that a priori, because the tokens are just index values. It doesn't have any way to know that token number 5 represents the character '5'.

> if you take your model that's trained on two-digit lists of integers and you run it on lists of any other type of elements it will not be able to sort them correctly.

To reiterate, the token dictionary basically just contains the characters "0123456789,():[]_\n". If you try to ask it to sort '(Tuesday, Monday)', it's just going to throw an exception because 'T' isn't a recognized token; it doesn't have a corresponding index. It's not even a question of whether it can sort them correctly or incorrectly.

> "Any input list"? How so?

I think the meaning is pretty clear. No algorithm can sort a list of elements that aren't members of a totally ordered set, so I wasn't attempting to imply that any input list meant that a neural network could somehow supersede this limitation.