Hacker News new | ask | show | jobs
by ggrrhh_ta 1596 days ago
I pointed it out above; even though it is text, the ASCII representation is just a different base for the numbers - base 2^8 - ('325' is '3' * (2^32) + '2' * (2^16) + '5' * 2^8 = 51 * 2^32 + 50 * 2^16 + 53 * 2^8); it should approximate those polynomial functions very well.
2 comments

Hmm. I’m not sure what you mean. Temperature is randomness; low temp is to get the most probable least random result. It’s what chess engines do during tournaments, for example.

The other parts seem unlikely. It has no knowledge of bases, except insofar as they appear in the training set. I saw this in our GPT chess work — even with strange tokenization, it learned chess notation well.

Sorry, I thought it was clear. A neural network, when untrained is just random noise that multiplies inputs by random weights over an over (+ normalization) until it reaches the output. When you train it with inputs whose outputs have are the process of applying some polynomial to those inputs, the weights can be set so that the output very closely approximates that polynomial. It never needs to know the base, and less randomness will help because the computations within the neural network match very well with the function you want to approximate. Still, it is not as simple, as outputting the correct ASCII representation is a challenge for example when carry is involved (100009999999999 + 1), however, the emergence of good arithmetic from a neural network itself should not be shocking.
You are clear but mistaken.

I give you points for creative thinking, but it’s important not to make inferences that “feel correct.” No matter what your gut is telling you, I would happily bet $10k that the emergence of arithmetic has nothing to do with the things you mention.

If an alternative training scheme were devised that didn’t rely on any of that, it would still result in a model that behaved more or less the same as what we see here. The properties of the training process influence the result, but they don’t cause the result — that would be like saying your vocal cords cause you to be an excellent orator. Vocal cords don’t form the ideas; the training process doesn’t form the arithmetic.

What we’re seeing is a consequence of a large training dataset. The more tasks a model can perform, the better it is at any individual task.

I know I can be mistaken (I would never take any amount any way, finding out the true emergence of the arithmetic capabilities of the network would be a price that outweights any sum of money, even if I am enormously mistaken), but I want to raise the point so that it is in the back of our minds. It it were a "simple" backpropagation network, it would not be surprising that it is just solving arithmetic by "finding out the formula" (fitting) to sum from base ASCII to base ASCII (as long as the output is not longer than the ones from the training sets). The dataset certainly has an influence, but I would argue that you can learn very good arithmetic with very small datasets. Also, if the training process would use different operations I would argue that, as long as it fits polynomials well, should be able to solve arithmetic in ASCII within bounds (would not generalize well to numbers of lengths longer than it was trained with).
Does GPT know about ASCII? My understanding was that these models use a dictionary of (initially) random vectors as input and learn their own text representation.
In that case, I would say that GPT's performance in arithmetic is something that we see because we are looking for it or want to find it but that is not there. It is an illusion. If we have no theory of why would it an arithmetic capability would emerge from GPT, then, there is no scientific discovery; at most, there a field survey, a taxonomist work, but no understanding is generated.