| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mjburgess 1099 days ago

There are an infinite number of sentences which describe what "being red" is, most of them have never been written.

Redness is not in the structure of those sentences. And there will always be an infinity of sentences which are True but cannot be infered by an LLM -- but can be so, trivially, by a person acquainted with redness.

In any case,

I'd need more time than I have at the moment to seriously state Hyp1 for your case -- but atm, I can say that because the data itself has the property, Hyp1 becomes much harder to state and the argument much subtler.

Since what is a "statistical distribution" of "ordinals" anyway? And how much memory is required to represent it? My sense is this distribution has highly redundant features which will be trivially compressible without learning any "sorting algorithm".

At a quick glance of your article it feels like you havent formulated Hyp1 correctly -- P(CorrectSort | f(HistoricalCases)) is perhaps arbitrarily high if some statistical f() can be chosen well.

2 comments

TeMPOraL 1099 days ago

> There are an infinite number of sentences which describe what "being red" is, most of them have never been written.

Which is exactly how the set of sentences actually written encodes in it the idea of "Redness". It's the "actually written" part that carries information about the real world.

> And there will always be an infinity of sentences which are True but cannot be infered by an LLM -- but can be so, trivially, by a person acquainted with redness.

That's cheating, because "a person acquainted with redness" presumably learned it by sight, which LLMs can't do just yet (at least the widely accessible ones can't). Would you also say that a person born blind also cannot infer those True sentences about redness? Because if they can, that means the concept of redness is capable of being taught through language, and so there's no reason LLMs couldn't pick up on it too.

link

jbay808 1099 days ago

> Redness is not in the structure of those sentences.

Sure; it's in the spectrum of reflected light. (Or perhaps, the retina's trichromal responsivity). But that physical concept can be meaningfully described by sentences. It doesn't require an infinite number of them to create a coherent world-model, which can do things like predicting that a blue object will become red if it moves away from you at a high enough speed. Which is something a human might be surprised by even after many years of visual experience with red objects -- unless they've read sentences about the Doppler effect in a physics textbook.

If you can manage to trick GPT-4 into revealing that it doesn't have a world-model of the concept of 'red', please show us!

> At a quick glance of your article it feels like you havent formulated Hyp1 correctly -- P(CorrectSort | f(HistoricalCases)) is perhaps arbitrarily high if some statistical f() can be chosen well.

Keep in mind, the LLM's structure was not hand-crafted to do well on this mathematical task. It was built to be good at language modelling, and initialized with essentially a uniform prior over all token sequences. Even if a dataset is efficiently compressible, that's no guarantee that the LLM will be able to compress it efficiently. In fact, many people would probably be surprised to learn that it can do this problem at all, let alone so well with so little training. But do think about the statistics of sorting a bit more. I think it's not as easily compressible as you think it is, except by an actual sorting algorithm. Again, you can compress it a bit with monotonicity and so on, but nowhere near the amount you'd need to sort a long list without errors, using so few parameters. I compute the number of sorted and unsorted lists in the footnotes.

One of the things that makes sorting tricky for an LLM is you always need to look at every item in the input list. Even if the previous output token was '99', you can't be sure you're now at the end of the list; you still need to count how many '99's were output already and how many are needed.

(The dataset itself, of course, does not contain the notion of sorting, a description of sorting, a test for sortedness, or any algorithm for sorting. It only contains a large but finite number of examples of sorted and unsorted lists. It's up to the LLM, and its training process, to discover the mechanism that generated these results.)

link

mjburgess 1099 days ago

> that's no guarantee that the LLM will be able to compress it efficiently

Your LLM here is 600MB which is a grossly inefficient compression of the sort space.

If LLMs "learned algorithms", the best compression would be on the order of bytes.

The python to generate this list is c. 1kb -- and you're using an obscene 600MB to do it!

What do you think all those MBs are doing? They're the extraordinary cost of the "statistical shortcut" of modelling the empirical distribution of sorted numbers.

NNs exploit distributional structure in the training data to compress it --- in this case there's huge amounts of distributional structure in numbers.

I think you've misunderstood the "statistical parrot" claim to be somehow that NNs are engaged in wrote memorization... or, what?

The claim is simply that all they do is statistically approximate the empirical distribution of the training dataset structure --- and if you force interpolation, then they provide arbitrarily precise compressions of that structure.

I'm not sure what a NN which can sort numbers shows, other than the distributional structure of a sort-numbers dataset is such that a NN can compress it into 600MB...

To be clear, the "statistical parrot" claim is that the statistical distribution of the empirical dataset D = (X, y) is being approximated by the weights, W = Compress(D) -- and that this distribution fails to be a representational model of y -- because no entailments of X (other than those in D) are captured.

Whereas representational models are not confined to the distribution of historical cases, ie., I can imagine variations on X leading to any given y; and variations on y leading to any given X -- without ever having experienced either.

You're showing the system vast amounts of numbers being sorted, so it learns the distribution of that data, so it can replay those sorts.

I'm not exactly sure why you think this is a reply to the relevant claims.

link

jstanley 1099 days ago

> The python to generate this list is c. 1kb -- and you're using an obscene 600MB to do it!

This isn't a fair comparison. The python code to sort a list is leveraging an enormous amount of information that is stored outside the python code, whereas the GPT version basically has to do it "from scratch", and in a very convoluted computing model.

A better comparison would be "how many bits does it take to encode a configuration of NAND gates that describes a computer that can sort 127-byte lists of number 1..100?"

I'm sure it's not as much as 600 megabytes, but it'll be a lot more than the python code.

link

YeGoblynQueenne 1098 days ago

The model the OP trained also runs on a computer. If I'm not mistaken, it also runs on Python itself.

Which means you don't need to count all the bytes in the infrastructure all the way down to the electric grid, maybe. You can compare a sorting algorithm to a sorting model, as stand-alone programs, on their relative size, and that will give you a good idea of how much work each is doing.

link

jbay808 1099 days ago

> If LLMs "learned algorithms", the best compression would be on the order of bytes.

Yes. Except:

(1) the model size is fixed during training, it would be impossible to obtain a bytes-sized result regardless of what it learns to represent. One might even open the thing up and find bubblesort* inside followed by 599 MB of junk DNA; that size is dictated by how it was initialized.

(2) I'm not claiming this model is a minimal size; I started with the biggest model I could train on my wimpy GPU and succeeded on my first and only try, which I think is a fairer representation of how GPT-4 was built than if I'd started by proving the minimum size of transformer that could represent the task** and then (surprise!) obtained it.

(3) Compared with the size of a map of all 10^80 unique input lists to all 10^36 correctly-corresponding sorted outputs, 600 MB is a remarkable compression ratio, even if it's not reducing it all the way down to exec("sort(input)").

(4) Nowhere do I make any claim that transformers are minimal or even space-efficient representation of an algorithm (or a world-model); in fact, they seem quite terrible in this respect, especially compared to arbitrary code. And doubtless there are a bunch of weights that got trained to near-zero and could be trimmed to make the matrices more sparse, or quantized, which is the kind of thing people do to compress an LLM itself but I didn't bother. What transformers do seem to do very well at, despite the overhead, is the differentiability that allows them to be trained in the first place, and also the flexibility to handle different kinds of problems. I could have trained the same blank-slate starting model to one that shuffles or reverses each list, or perhaps to do one or the other depending on whether the first number is odd or even, or any number of other tasks.

> You're showing the system vast amounts of numbers being sorted, so it learns the distribution of that data, so it can replay those sorts.

It's almost definitely the case that every list it's tested on, and sorts 100% correctly, is a list it has never seen in training (unless it's a very short list, but I control for that). My training dataset is only about 100 MB; given the number of random lists, it's vanishingly unlikely that it's seen almost any of them, let alone the 100% of them that it is able to sort correctly. (The tests, of course, were not drawing from the validation set either; I test the model by generating new lists on the fly, because that's easy to do).

> statistically approximate the empirical distribution of the training dataset structure

Can you provide more details about what you mean by this distributional structure that can be compressed without a generally-correct sorting algorithm? How would you define a similarity measure between distinct random lists that allows for this kind of interpolation?

* Well, probably RASP-sort, not bubblesort. Also, it would need to include definitions of things like the comparison operator between all tokens, because it doesn't have a numeric datatype built in, or even the idea of numbers as an ordered set; it has to learn all that.

** (the Weiss paper does this, and lo and behold, transformers can indeed sort).

link

mjburgess 1098 days ago

> Can you provide more details about what you mean by this distributional structure

The distribution of sorted digits is:

(0 1 2 3 4 5 6 7 8 9) before

(1 before 0 1 2 3 4 5 6 7 8 9) before

(2 before 0 1 2 3 4 5 6 7 8 9) before

(3 before 0 1 2 3 4 5 6 7 8 9) ...

...

When you compute the search space you're treating each number as a unique token (ie., that all ordinals are unique) -- but its not sorting unique ordinals, it's sorting digits in a sequential model ie., it learns P(Next|Prev)

The (sequential) distribution of digits amongst sorted numbers is tiny

link

jbay808 1098 days ago

> The (sequential) distribution of digits amongst sorted numbers is tiny

This is why 10^80 random lists gets reduced to only 10^36 sorted lists. However, 10^36 is still very large with respect to the size of the model.

link

mjburgess 1098 days ago

You're treating each list as unique, all the lists have a distribution of digits in common... I'm at a loss to even understand what you're saying here really -- this is why you need to actually state, formally, what you think the "LLMs are just stats" hypothesis amounts to.

It seems you think it amounts to saying LLMs sample from a combinatorial space, naively construed -- but that isnt the claim?

The claim is rather, they sample from a statistical distribution of tokens.

Take each position in the input vector, 1...127. It needs to "learn":

P(x0 position | y, x1...x127 positions), P(1|y, 2...127), P(2|y, 3...127), etc.

Which is a family of 127 conditional distributions that seem trivial to learn.

I really don't know why you think the size of a combinatorial space is relevant here?

All the sorted lists share basically the same tiny family of conditional distributions { P(x_i | x_(i-1)...x_127) }

link