Hacker News new | ask | show | jobs
GPT-3 is much better at math than it should be (ravisparikh.substack.com)
50 points by Liron 1202 days ago
9 comments

I'd be curious to see more examination of the questions and answers at the token level, rather than by counting digits or calculating percentage error. For example, according to https://platform.openai.com/tokenizer, 727941 + 761830 is split as 7,279,41, +, 76,18,30. The answer given was 1589771 (as opposed to 1489771). To me that looks like it correctly added 41 and 30, but had trouble with the mis-matched tokenizations of 7,279 and 76,18. I wonder if that sort of pattern would hold in general?
The "edit distance" errors would seem to be the tell that yours is a better explanation, perhaps along with problems matching near-edit-distance numbers (tokens) in latent space.

In A operand B equals C any of the 3 numbers can compress to a same space as a 'close' number or numbers in some problem out in web space. So while the author googled the expressions he asked, I wouldn't expect those to be found verbatim when answers are wrong; rather, to your point, the author should web search for tokens, or web search for problems within, say, +/- 3 for each digit of tokens, in all permutations.

>> But this program is representable by a neural net; after all, neural nets are turing complete. [1]

This is indeed evidence of an interesting phenomenon. It seems that many of the hare-brained things that people say lately are conclusions they have drawned starting from the premise that neural nets are somehow magickal and mysterious, and so they can do anything and everything anyone could imagine, and we don't even really need to come up with any other explanation about those wonders, than "it's a neural net!".

So, for example, the author can claim that "there’s some sort of fuzzy arithmetic engine at the heart of GPT-3", without having to explain what, exactly, is a "fuzzy arithmetic engine" (it's just "some sort" of thing, who cares?) and why we need such a device to explain the behaviour of a language model.

Then again, what's the point? People write stuff on the internets. Now we have language models trained on that nonsense. Things can only get worse.

_______________

[1] The link in the article points to a paper on the computational capabilities of Recurrent Neural Nets (RNNs), not "neural nets" in general. The Transformer architecture, used to train GPT-3's model is not an RNN architecture. In any case, the linked paper, and papers like it, only show that one can simulate any Turing machine by a specially constructed net. To learn a neural net that simulates any Turing machine (i.e. without hand-crafting) one would have to train it on Turing machines; and probably all Turing machines. GPT-3's model, besides not being an RNN, was trained on text, not Turing machines, so there's a few layers of strong assumptions needed before one can claim that it somehow, magickally, turned into a model of a Turing machine.

Anyway, the Turing-complete networks discussed in the linked paper, and similar work, inherit the undecidability of Universal Turing Machines and so it is impossible to predict the value of any activation function at any point in time. Which means that, if a neural net ever really went Turing complete, we wouldn't be able to tell whether its training has converged, or if it ever will. So that's an interesting paper- that the author clearly didn't read. I guess there's too many scary maths for a "layman". Claiming that GPT-3 has "some sort of fuzzy arithmetic engine" doesn't need any maths.

Thanks for taking the time to read the article and comment. Appreciate your feedback. As you point out, my last couple paragraphs were somewhat speculative and handwav-y. Do you have an alternative viewpoint on what allows LLMs to be able to somewhat accurately answer complicated math questions, despite lacking an explicitly programmed math solver? It sounds like you may be better informed than me–would love to hear your thoughts.

> that the author clearly didn't read. I guess there's too many scary maths for a "layman".

No need for the personal attack. I did read the paper and the math in the paper is not particularly complicated.

Well, that's awkward. I didn't realise you were on HN. I'm sorry for the personal tone of my comment. You are right that it was uncalled for.

The paper you linked is clear on the scope of its proofs and in any case it's a very big assumption to say that "neural nets are Turing complete", when there are scant few such proofs, compared with the large number of different architectures (for most of which, no careful investigation of their computational capabilities is ever done anyway).

You could add a clarification to your article.

>> Do you have an alternative viewpoint on what allows LLMs to be able to somewhat accurately answer complicated math questions, despite lacking an explicitly programmed math solver?

Yes, it's because they're language models. In particular, they're very powerful, very smooth (in the statistical sense) language models trained to represent gigantic text corpora. Their ability to produce correct answers once in a while is not a surprise and does not need any other explanation.

Predicting what a language model (big or small) will output is another matter, so one particular instance of generated output might be surprising in the sense that the user won't expect it - not in the sense that the model shouldn't be able to produce it.

In any case, it's clear that the performance of those models depends on the prompts. Change the prompt slightly and you get a different answer, to any question. That suggests retrieval from memory (modulo stochasticity) much more than it suggests computation. And we know that these models are not models of computation, so there's no question what's really going on.

When I say "retrieval from memory" I don't mean that these models memorise whole sequences of tokens verbatim. To make a very big fudge about it, it's as if they've memorised templates that they can then apply to questions to generate the right answers.

I guess that still sounds magickal and mysterious if one hasn't worked with language models before, so all I can say is, if you are really curious, and really want to understand the specifics, you should try to learn more about language models.

I suggest the following as a starting point:

Eugene Charniak, Statistical Language Learning

https://mitpress.mit.edu/9780262531412/statistical-language-...

Dan Jurafsky and James H. Martin, Speech and Language Processing

https://web.stanford.edu/~jurafsky/slp3/

Chris Manning and Hinrich Schűtze, Foundations of Statistical Natural Language Processing

https://nlp.stanford.edu/fsnlp/

Those are rather "wax-on, wax-off", but if you want to learn Karate, that's where to begin. Then you can go on to beat up the Transformers and win the girl.

The Charniak book in particular is small and sweet and easy to read. Start there.

On the one hand, the GPT3 model is not Turing complete because run-time of each invocation is linear in the input size (number of tokens). There is no input that will generate an infinite output, for example.

On the other hand, if you run each invocation on the output of the previous invocation, it seems plausible that you could give it a prompt with a description of a Turing machine and have it simulate indefinitely. In this way of looking at it, GPT3 only encodes a transition table of a Turing machine. we only have to believe that one can code arbitrary transition tables inside GPT3, be it through an initial prompt or through manual adjustment of the internal weights.

It’s pretty obvious that it can not correctly represent arbitrary transition tables: Just construct a Turing Machine that has a larger transition table than can be encoded in the language model.

On the other hand this is just a theoretical argument. Every existing computer is also not a Turing Machine as it has finite memory.

Given that ChatGPT can't correctly answer questions like "What weighs more, a pound of bricks or two pounds of feathers", I can't say I agree.
I was reminded of this sketch as well, quite fitting.
That was a riot. Is the rest of the show this funny?
Generally yes.
A brick of pounds is way heavier
This is more a physics problem than mathematics no?
It's one unit vs two units. One number is greater than the other.

That's maths, not physics.

I asked ChatGPT to play this game:

I will give you 2 strings A and B.

You must tell me what operations from the list below to transform string A into B. You can use as many operations as you want but the more operations the less points you get.

Insert(a,b)- insert character a at position b. Delete(a)- delete character at position b. Swap(a,b)- swap characters a position a with position b.

A: ello B: Hello

Answer is insert(H, 0)

Try it for yourself and you will quickly see how bad ChatGPT is and how simple it is to trick humans you are intelligent.

For what it's worth, Bing answers correctly and unceremoniously (although i had to reframe the question as a riddle with words, as the word 'string' seems to trigger it's anti-code-completion guardrail)

https://imgz.org/i4R5tbXH.jpg

What did it "clear the slate" from?
Some previous conversation, I forget what exactly. When you tap the "new conversation" button it displays a message to confirm context was cleared
I tried this and got a funny answer with the strings "bananas" and "pandas":

Great! Here are some possible ways to transform string A into B using the provided operations:

Insert(p, 0) -> "pananas" Swap(n, 2) -> "pananas" Swap(d, 4) -> "pandnas" Swap(a, 5) -> "pandasn"

Swap(b, 0) -> "abananas" Delete(a, 1) -> "bnanas" Swap(p, 1) -> "bpanas" Swap(d, 3) -> "bpands" Insert(a, 4) -> "bpanda" Swap(s, 5) -> "bpandas"

Insert(p, 0) -> "pananas" Insert(d, 1) -> "pdananas" Swap(a, 2) -> "pdnaanas" Swap(p, 3) -> "pdnapanas" Swap(a, 4) -> "pdnapnaas" Swap(s, 5) -> "pdnapnasa" Delete(a, 6) -> "pdnapnas"

This is most likely due to how tokenization works, i.e. the model does not even get to see individual letters in most cases, just tokens
Bing also fails on this one:

https://imgz.org/i9ZzrkA3.jpg

wow that is really entertaining lol. It simply can't get this one even close to right.
ChatGPT is now two different models. Default ("turbo") or "legacy" (slower and better from a week ago or whatever). Not specifying which in these types of experimental reports is a big oversight.

You will not see the option unless you buy ChatGPT Plus. I assume the non-plus is "turbo" now.

he's using GPT-3.5 aka text-davinci-3 so not chatGPT
He also tested ChatGPT.
I think it's better than most people at arithmetic problems if they have to solve by hand.

Perhaps if the prompt included "double-check your answer", just like math teachers tell students, the correct answer rate would be higher?

It's better at math than I am. I can't get anywhere close to GPT-3's accuracy when multiplying two three-digit numbers, in the same amount of time.
It's different to our ideas of being good at maths. On the one hand, as you mention it can sometimes resolve large individual calculations, but I have seen it be wrong for very simple world problems or even explicit equations with only a couple operations which I myself can immediately catch. It's not a fundamental flaw: I expect it to get better at this as time goes on. But its other of those fascinating little quirks it has that are so noteworthy only because of how competent it appears to be in other areas
It is awful at math because it has no understanding of anything.

It can output the correct answer if the correct answer has previously been shown to it, but it may equally just output garbage because it just rngs its answer.

Are you sure about that? Did you even try to ask ChatGPT?

Well I did, and here's what came out:

  Me: Hi ChatGPT. I want to ask you a math question, do you think you can answer it? The question is 328 + 298

  ChatGPT: Hello! Of course, I can help you with that math problem. The sum of 328 and 298 is:

  328

  298

  626

  Therefore, the answer to your question is 626. Let me know if you have any other questions or if there's anything else I can help you with!
  
  Me: That's very good. How did you do that ? 

  ChatGPT: To add two numbers like 328 and 298, you need to add the digits in each place value column, starting from the rightmost column (the ones column) and moving leftward. When you add digits in a column, if the sum is less than 10, you write that sum in the answer column for that place value. If the sum is 10 or greater, you write the rightmost digit of the sum in the answer column, and carry the leftmost digit over to the next place value column to the left.

  So, in this case, we start by adding the ones column: 8+8=16. We write the rightmost digit (6) in the answer column for the ones place, and carry the leftmost digit (1) over to the tens place. Then, we add the digits in the tens column: 1+2+9=12. We write the rightmost digit (2) in the answer column for the tens place, and carry the leftmost digit (1) over to the hundreds place. Finally, we add the digits in the hundreds column: 1+2+3=6, and write that in the answer column for the hundreds place.

  So the final answer is 626.
Ok, everyone, prepare to have your mind blown. I just taught ChatGPT a math party trick. ChatGPT learned it and applied it quickly. Here's the exchange:

  Me: ChatGPT, I want to teach you a math trick. It's a nice one, it's good for nerd parties. You tell people that you can do natural logarithms in your head. Of course, it's going to be with limited precision, one or two exact decimal places. You ask someone for a random number, let's say up to 1000 and with 2 decimal places (the decimal places are irrelevant, but they make the trick more impressive). Let's say they pick 728.12. You get the natural logarithm in two easy steps: the first one is called range reduction, and the second one Taylor series. The range reduction step has the goal to reduce the number to one slightly above 1. You keep a short table of logarithms in your memory. The one I use is log(2) = 0.7, log(10) = 2.3, from which it follows that log(5) = log(10)-log(2) = 1.6, and log(3) = 1.1. If you want to get fancy, you can remember log(7) = 1.95, but that's optional. So, we start with 728.12. We divide by 100 (which has a log of 4.6) and get 7.2812. We then divide by 6 (which is 2 times 3, so it has a log equal to 0.7 + 1.1 = 1.8). 7.28 divided by 6 is about 1.21. We don't need more precision than that. We are done with the first step. The second step is Taylor series. We use log(1+x) = x -x^2/2 for small x. In our case log(1.21) is roughly 0.21 -0.21^2/2. 0.21^2 is about the same as 0.2^2 which is 0.04. We divide by 2 and get 0.02. So log(1.21) is about 0.19. Now we add back the logs that we used for range reduction (4.6 and 1.8) and get 6.4 + 0.19 = 6.59. This is very, very close to the exact logarithm, which is 6.5904. Do you think you understand this trick ? 

  ChatGPT: Yes, I understand the trick you described. It's a clever way to estimate natural logarithms in your head with limited precision. The range reduction step involves using logarithmic identities to reduce the given number to a value that is slightly above 1, and then using the Taylor series expansion of the natural logarithm around 1 to estimate the logarithm of the given number. The result is then adjusted by adding back the logarithms used in the range reduction step. Overall, it's an impressive party trick that demonstrates an understanding of logarithmic identities and Taylor series expansion.

  Me:  Ok. Can you apply this to estimate log(234.56) ? And then walk me through the steps 

  ChatGPT: Sure, I can walk you through the steps to estimate the natural logarithm of 234.56 using the trick you described earlier. Here are the steps:

  Step 1: Range Reduction

  We start by dividing 234.56 by 100 (which has a log of 4.6) to get 2.3456. We then divide by 2 (which has a log of 0.7) to get 1.1728.

  Step 2: Taylor Series

  Next, we use the Taylor series expansion of the natural logarithm around 1 to estimate the logarithm of 1.1728:
log(1.1728) = (1.1728 - 1) - (1.1728 - 1)^2/2 = 0.1728 - 0.0276 = 0.1452

  Step 3: Adding back logarithms

  Finally, we add back the logarithms used in the range reduction step to get the final estimate:

  log(234.56) ≈ log(100) + log(2) + log(1.1728) = 4.6 + 0.7 + 0.1452
    ≈ 5.4452

   Therefore, using this trick, we can estimate the natural logarithm of 234.56 to be approximately 5.4452, which is close to the exact value of 5.4553.
My immediate question would be whether this algorithm occurs in the training data; I'm fairly certain it does. If it doesn't then it's quite interesting it's able to 'parse' the steps from the description without losing its train of 'thought'. Even if it does, learning the algorithm from this description and applying it correctly is interesting...
> learning the algorithm from this description and applying it correctly is interesting...

I find even more interesting that ChatGPT made mistakes, but it made them in almost a human-like way: it said

  (1.1728 - 1)^2/2 = 0.0276 
when in reality it is 0.0149. The mistake is not big though (about 0.012) so the final result is not impacted by much.

Finally, the last statement ("the exact value of 5.4553") is simply bull. The exact value is 5.4577.

Yeah, but I didn't expect perfection. Simply getting this close is very interesting imo, even if it does not know the exact value. More parameters might have an even better effect.
Nice! Amazing. Also I don't feel as bad about not being invited to parties now.
GPT-3 is much worse at math in comparison to BLOOM. GPT-3 honestly sux at math, as it should.