| > that's no guarantee that the LLM will be able to compress it efficiently Your LLM here is 600MB which is a grossly inefficient compression of the sort space. If LLMs "learned algorithms", the best compression would be on the order of bytes. The python to generate this list is c. 1kb -- and you're using an obscene 600MB to do it! What do you think all those MBs are doing? They're the extraordinary cost of the "statistical shortcut" of modelling the empirical distribution of sorted numbers. NNs exploit distributional structure in the training data to compress it --- in this case there's huge amounts of distributional structure in numbers. I think you've misunderstood the "statistical parrot" claim to be somehow that NNs are engaged in wrote memorization... or, what? The claim is simply that all they do is statistically approximate the empirical distribution of the training dataset structure --- and if you force interpolation, then they provide arbitrarily precise compressions of that structure. I'm not sure what a NN which can sort numbers shows, other than the distributional structure of a sort-numbers dataset is such that a NN can compress it into 600MB... To be clear, the "statistical parrot" claim is that the statistical distribution of the empirical dataset D = (X, y) is being approximated by the weights, W = Compress(D) -- and that this distribution fails to be a representational model of y -- because no entailments of X (other than those in D) are captured. Whereas representational models are not confined to the distribution of historical cases, ie., I can imagine variations on X leading to any given y; and variations on y leading to any given X -- without ever having experienced either. You're showing the system vast amounts of numbers being sorted, so it learns the distribution of that data, so it can replay those sorts. I'm not exactly sure why you think this is a reply to the relevant claims. |
This isn't a fair comparison. The python code to sort a list is leveraging an enormous amount of information that is stored outside the python code, whereas the GPT version basically has to do it "from scratch", and in a very convoluted computing model.
A better comparison would be "how many bits does it take to encode a configuration of NAND gates that describes a computer that can sort 127-byte lists of number 1..100?"
I'm sure it's not as much as 600 megabytes, but it'll be a lot more than the python code.