Hacker News new | ask | show | jobs
by jbay808 1095 days ago
> If LLMs "learned algorithms", the best compression would be on the order of bytes.

Yes. Except:

(1) the model size is fixed during training, it would be impossible to obtain a bytes-sized result regardless of what it learns to represent. One might even open the thing up and find bubblesort* inside followed by 599 MB of junk DNA; that size is dictated by how it was initialized.

(2) I'm not claiming this model is a minimal size; I started with the biggest model I could train on my wimpy GPU and succeeded on my first and only try, which I think is a fairer representation of how GPT-4 was built than if I'd started by proving the minimum size of transformer that could represent the task** and then (surprise!) obtained it.

(3) Compared with the size of a map of all 10^80 unique input lists to all 10^36 correctly-corresponding sorted outputs, 600 MB is a remarkable compression ratio, even if it's not reducing it all the way down to exec("sort(input)").

(4) Nowhere do I make any claim that transformers are minimal or even space-efficient representation of an algorithm (or a world-model); in fact, they seem quite terrible in this respect, especially compared to arbitrary code. And doubtless there are a bunch of weights that got trained to near-zero and could be trimmed to make the matrices more sparse, or quantized, which is the kind of thing people do to compress an LLM itself but I didn't bother. What transformers do seem to do very well at, despite the overhead, is the differentiability that allows them to be trained in the first place, and also the flexibility to handle different kinds of problems. I could have trained the same blank-slate starting model to one that shuffles or reverses each list, or perhaps to do one or the other depending on whether the first number is odd or even, or any number of other tasks.

> You're showing the system vast amounts of numbers being sorted, so it learns the distribution of that data, so it can replay those sorts.

It's almost definitely the case that every list it's tested on, and sorts 100% correctly, is a list it has never seen in training (unless it's a very short list, but I control for that). My training dataset is only about 100 MB; given the number of random lists, it's vanishingly unlikely that it's seen almost any of them, let alone the 100% of them that it is able to sort correctly. (The tests, of course, were not drawing from the validation set either; I test the model by generating new lists on the fly, because that's easy to do).

> statistically approximate the empirical distribution of the training dataset structure

Can you provide more details about what you mean by this distributional structure that can be compressed without a generally-correct sorting algorithm? How would you define a similarity measure between distinct random lists that allows for this kind of interpolation?

* Well, probably RASP-sort, not bubblesort. Also, it would need to include definitions of things like the comparison operator between all tokens, because it doesn't have a numeric datatype built in, or even the idea of numbers as an ordered set; it has to learn all that.

** (the Weiss paper does this, and lo and behold, transformers can indeed sort).

1 comments

> Can you provide more details about what you mean by this distributional structure

The distribution of sorted digits is:

(0 1 2 3 4 5 6 7 8 9) before

(1 before 0 1 2 3 4 5 6 7 8 9) before

(2 before 0 1 2 3 4 5 6 7 8 9) before

(3 before 0 1 2 3 4 5 6 7 8 9) ...

...

When you compute the search space you're treating each number as a unique token (ie., that all ordinals are unique) -- but its not sorting unique ordinals, it's sorting digits in a sequential model ie., it learns P(Next|Prev)

The (sequential) distribution of digits amongst sorted numbers is tiny

> The (sequential) distribution of digits amongst sorted numbers is tiny

This is why 10^80 random lists gets reduced to only 10^36 sorted lists. However, 10^36 is still very large with respect to the size of the model.

You're treating each list as unique, all the lists have a distribution of digits in common... I'm at a loss to even understand what you're saying here really -- this is why you need to actually state, formally, what you think the "LLMs are just stats" hypothesis amounts to.

It seems you think it amounts to saying LLMs sample from a combinatorial space, naively construed -- but that isnt the claim?

The claim is rather, they sample from a statistical distribution of tokens.

Take each position in the input vector, 1...127. It needs to "learn":

P(x0 position | y, x1...x127 positions), P(1|y, 2...127), P(2|y, 3...127), etc.

Which is a family of 127 conditional distributions that seem trivial to learn.

I really don't know why you think the size of a combinatorial space is relevant here?

All the sorted lists share basically the same tiny family of conditional distributions { P(x_i | x_(i-1)...x_127) }

I agree a neural network can certainly learn the conditional distributions that let it make that choice correctly every time. Once it has done so, then do you not have a sorting algorithm?
So this is what I thought you would say, and it's the origin of the issues here: to say that LLMs are "statistical parrots" is just to say they learn conditional distributions of text tokens.

So you aren't replying to the "only stats" claim: that is the claim!

The issue is that language-use isn't a matter of distributions of text tokens: when i say, "the sky is clear today!" it is caused by there being a blue sky. Then I say, "therefore I'd like to go out!" it is caused by my preferences, etc.

So if we had a generative causal model of language it would be something like this: Agent + Environment + Representations ---SymbolicTranslation---> Language.

All LLMs do is model the data being generated by this process, they dont model the process (ie., agents, environments, representations, etc.)

They say, "it is a nice day" only because those tokens match some statistical distribution over historical texts. Not because it has judged the day nice.

To model language is not to provide an indistinguishable language-like distribution of text tokens, but rather, for an agent to use language to express ideas caused by their internal states + the world.

In the case of sorting numbers, the tokens themselves have the property (ie., mathematical properties such as ranking are had by ranked tokens). So learning the distribution is learning the property of interest.

This is why no papers which demonstrate NNs "have representations" etc. which appeal to formal properties the data itself has, are even releveant to the discussion. Yet, all this "world model, algorithm, blah blah" said of NNs, is only ever shown using data whose "unsupervised model" constitues the property of interest.

Statistical models of the distributions of tokens are not models of the data generating process which produces those tokens (unless that process is just the distribution of those tokens). This is obvious from the outset.