| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by YeGoblynQueenne 1095 days ago

Your model is not sorting correctly and it sure has not learned any "algorithm". At best it has learned to approximate a sorting algorithm. That's what statistical machine learning models do, they are function approximators; not program learners.

Also, Machine Learning 101: you test your models on a test set that is disjoint to the training set. To clarify, we do this not because it's in the book and that's the rules, but because, by testing the model on held-out data, we can predict the error the model will have on unseen data (i.e. data not available to the experimenter). And we do this because under PAC-Learning assumptions a learner is said to learn a concept when it can correctly label instances of the concept with some probability of some error. In real-world situations we do not know the true concept, so we test on held-out data to approximate the probability of error.

Bottom line, if you train a model to do a thing and you don't test it carefully to figure out its error, you might claim it's learned something, but in truth, you have no idea what it's learned.

(To clarify: you tested on the train data assuming there's a low probability of overlap. Don't do that if you're trying to understand what your models can do).

1 comments

jbay808 1095 days ago

> it sure has not learned any "algorithm". At best it has learned to approximate a sorting algorithm. That's what statistical machine learning models do, they are function approximators; not program learners.

Transformers are RASP programs, which includes sorting programs. See the Weiss paper (https://arxiv.org/pdf/2106.06981.pdf).

> Also, Machine Learning 101: you test your models on a test set that is disjoint to the training set. To clarify, we do this not because it's in the book and that's the rules, but because, by testing the model on held-out data, we can predict the error the model will have on unseen data

The probability of a test list existing in the training set is less than 10^-70.

link

YeGoblynQueenne 1094 days ago

>> Transformers are RASP programs, which includes sorting programs. See the Weiss paper (https://arxiv.org/pdf/2106.06981.pdf).

That's one preprint on arxiv, that makes a wild claim about a new concept that they acronymise as "RASP". It's not any kind of established terminology, nor is it anything but a claim.

What is certainly established is that a function, and an algorithm, are different objects. To clarify, a function is a mapping between the elements of two sets, whereas an algorithm is a sequence of operations that calculates the result of a function and is guaranteed to terminate. Algorithms are also typically understood to be provably correct and to have some provable asymptotic complexity (as opposed, for example, to heuristics) but that's not a requirement.

So for example, if you have a function ƒ between sets X and Y, and an algorithm P that calculates the result of ƒ, then you can give any element of X to P and it will return (in fact, construct) an element of Y. Crucially, ƒ is not P, and P is not ƒ.

Now, when you train a machine learning model, you are typically training a function ƒ̂ (with a little hat) to approximate ƒ. That means that your trained ƒ̂ is a function that maps some of the elements of X to the same elements of Y as ƒ, but not all. It's an approximation. So you get some amount of error, as in your experiment.

So what you've done in your experiment is that you trained a model to approximate a mapping between the set of lists, to itself (where the input list is any of the lists in your training set and the output is the same list, sorted). Your model is not an algorithm, and you cannot train an algorithm with a language model.

I appreciate that, learning an algorithm, is what you wanted to achieve, but in science we don't choose the answer that pleases us, we choose the answer that makes the most sense- and a good heuristic for that is that the answer that makes more sense is the simplest one. Here, in order to convince yourself that you have trained a language model to learn an algorithm, rather than an approximator, you have chosen to rely on a preprint with a completely novel and untested concept that someone put on the internet, rather than the well-understood abstractions of elementary computer science, so not at all the simplest explanation. That is not a good idea. You will not understand what is going on, if you rely on that kind of explanation. I assume you are trying to understand?

Edit: incidentally, you don't need a transformer to train an approximator to a sorting function. You can do that with a multi-layer perceptron, or a logistic regression, certainly with an LSTM. Ceteris paribus, you'll get the same results.

>> The probability of a test list existing in the training set is less than 10^-70.

But the same probability if you held the test set out would be 0, so why not do that? It's not hard to do.

Is there a good reason not to do that?

Btw, lists are composite objects. How much overlap is there between your training and test lists? Do you know?

Edit: meh. HN messes up my nice f-with-hook-and-combining-circumflex-accent. DAAAAANG!!!!

link

jbay808 1094 days ago

> that's one preprint on arxiv, that makes a wild claim about a new concept that they acronymise as "RASP". It's not any kind of established terminology, nor is it anything but a claim.

Would you change your mind for a different link, like this one? http://proceedings.mlr.press/v139/weiss21a.html

I think you would enjoy learning about RASP, rather than taking such a hardline skeptical position.

> a function is a mapping between the elements of two sets, whereas an algorithm is a sequence of operations that calculates the result of a function and is guaranteed to terminate

I'm aware. Transformers (and RASP programs) are guaranteed to terminate; that's one of their nice properties.

> Is there a good reason not to do that?

Balanced against the value of my unpaid time, a probability of 10^-70 is low enough for the purposes of a quick and fun test.

Speaking of which, I'm going to enjoy my weekend now. I hope you enjoy yours!

link

Al0neStar 1094 days ago

That's the same paper.

link

YeGoblynQueenne 1094 days ago

EDIT: I realise I was mistaken about the OP. He is not an undergarduate student, as I initially thought. His substack profile says he is a professional engineer and consultant. So his complete cluelessness about computer science fundamentals is not the result of inexperience, and his article is nothing more than an attempt to jump on the current bandwagon of LLM hype rather than an attempt to make sense of things. I thought I was helping a CS grad learn something! What an idiot I am! Fuck. φτου γαμώ την Παναγία μου.

[Earlier text of my comment left in the interest of something or other]

It is, but it's published in the proceedings of the ICML, which means it's been peer reviewed.

The OP has checked out (I guess all this computer scienc-y stuff is boring on a weekend), but even a peer-reviewed article is not enough to cause us to let go of good, old-fashioned computer science. The article basically invents its own language and then proceeds to map transformers to it, to claim that transformers can learn various kinds of programs. It's not convincing.

In any case, learning to sort lists by neural nets is not something new, or unique to transformers, and there's pretty clear understanding of how it works. I explain why it doesn't constitute learning an algorithm in my comment above. The RASP paper doesn't change that. I mean, Recurrent Neural Nets have a known equivalence to FSMs but even they cannot learn algorithms but only approximate them. The OP wrote his article in an obvious effort to understand why GPT is "not an n-gram" even if it behaves like an n-gram model (well, it's a language model, it doesn't matter what it's trained on) so I'm guessing he can appreciate the need for clarity in explaining empirical results and he probably will want to think further on what, exactly, his experiment has shown. I hope my little comment above will help him do that.

link