| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by spupe 1596 days ago
	Thank you for this. Technically it's not GPT-3, but GPT-NeoX-20B, although they are based on a similar architecture. The poor performance is most likely due to not having a large database of math problems to draw from. Github, for example, is part of the dataset that is used to train both GPT-3 and GPT-Neo variants, which is partly why they can generate meaningful code (sometimes). I wonder how a model finetuned for math would perform.

6 comments

dang 1596 days ago

Ok, we've reverted the title now. Thanks!

(Submitted title was 'GPT-3's answers to arithmetic questions')

link

williamtrask 1596 days ago

Poor performance is more likely due to how transformer neural networks view numbers. It memorises them like words instead of modeling their numerical structure. Thus even if it’s seen the number 3456 and 3458, it knows nothing of 3457. Totally different embedding.

It’s like a kid memorising a multiplication table instead of learning the more general principle of multiplication (related: this illusion is why big models are so popular. Memorise more stuff.)

Paper (NeurIPS/DeepMind): https://arxiv.org/abs/1808.00508

link

Isinlor 1596 days ago

Take a look at this paper:

Deep Symbolic Regression for Recurrent Sequences https://arxiv.org/abs/2201.04600

If you look at embedding visualization it is very clear that the model learns order of numbers.

(Interactive demo: http://recur-env.eba-rm3fchmn.us-east-2.elasticbeanstalk.com... )

There is also:

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets https://arxiv.org/abs/2201.02177

Again, looking at visualizations the model very clearly grasps the structure of the function it models.

link

pfortuny 1596 days ago

Modulo 97 (the arxiv paper). That is what they do.

It is quite easy to grok operations modulo 97.

link

YeGoblynQueenne 1596 days ago

The "Deep Symbolic Regression" paper reports very poor generalisation results that break off after a small n (where n is the number of tokens in the predicted sequence). It works some of the time for n = 1 (predicts the next token) but accuracy drops off for n = 10. No results are reported for N > 10 as far as I can tell in the "Out of Domain Generalization" section (which is the meat and potatoes of the "generalization" claim).

tl;dr they can sometimes generalise to the next 1 to 10 tokens (digits or operators), but no more.

This kind of short-term "generalisation" on OOD data is standard in neural nets trying to approximate symbolic regressions or things like grammars etc as far as I know.

I do like they use 'Out of Domain" rather than "Out of Distribution" as a target though. That makes more sense.

link

Isinlor 1596 days ago

I don't think you will find any human that will extrapolate sequence generated with more than 10 operators. And longer input sequences are actually easier to handle - fig 1. the right most graph.

If you think you can do better than their program then:

Seq1: [0, 1, 2, 3, 6, 7, 13, 26, 32, 58, 116, 142, 258, 516]

Seq2: [2, 2, 3, 5, 10, 12, 22, 44, 54, 98, 196, 240, 436, 872]

Seq3: [3, 1, 8, 9, 18, 19, 37, 74, 92, 166, 332, 406, 738, 1476]

Their program is able to guess correct continuation with one more sequence element.

SHA1 hash for verification: bef5e213340f91258b3b9a0042c9c083dd91cb80

link

YeGoblynQueenne 1596 days ago

I don't think I understand what you mean. Aren't all the sequences on the Online Encyclopedia of Integer Sequences created by humans? We clearly have the tools to extrapolate sequences from examples, rather than just eyballing them and trying to guess them. For instance: we have maths. So I must have misunderstood your meaning?

link

Isinlor 1596 days ago

If you look at the 3 sequences I gave you, can you guess following elements of the sequence?

We can create sequences, but guessing underlying patterns is a lot more difficult.

Humans will have very hard time if you go beyond around 10 operators in a pattern used to generate a sequence.

My guess is that their model will be better at it than me or you.

link

nicholast 1596 days ago

The cool thing about math applications is just how easy it would be to generate synthetic data. That these large language models haven't attempted to supplement their gigabytes+ scale data sets with such is an oversight.

link

williamtrask 1596 days ago

Or you could just use a 50cent calculator.

Note, you’d need to train such a model on data teaching it about the relationship of every number to every other number when run through every function. Yes, infinite synthetic data, but you’re just memorising stiff you can already generate

link

not2b 1596 days ago

Or build a model that has "peripherals". Oh, I'm being asked to do math. Let's put it in my calculator app. Everything doesn't have to be in one uniform network.

Evidently the brain works that way: the cortex is built on top of older components, so it doesn't have to figure out basic metabolism the same way it has to learn to identify people.

link

plutonorm 1596 days ago

It's recently been shown that even though the numbers are represented with different tokens, the network learns to form an internal representation that understands the progression from one token to the next.

link

nikolayasdf123 1596 days ago

The idea that each number has to be inside ones Brain or Neural Network or Token is plainly wrong.

Network has to grasp the "abstract" number, but it clearly did not grasp that concept.

link

Isinlor 1596 days ago

How would you test if it grasped the concept?

link

plutonorm 1596 days ago

https://arxiv.org/pdf/2201.02177.pdf

This paper shows fairly conclusively that the network 'groks' modular addition.

link

pfortuny 1596 days ago

Modulo 97.

This is what it is. Not "general arithmetic".

link

catach 1596 days ago

Being able to extrapolate to numbers that were not in the training set, perhaps? At least that'd be a basic part of the requirement.

link

Isinlor 1596 days ago

Sure:

Deep Symbolic Regression for Recurrent Sequences https://arxiv.org/abs/2201.04600

(Interactive demo: http://recur-env.eba-rm3fchmn.us-east-2.elasticbeanstalk.com... )

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets https://arxiv.org/abs/2201.02177

Both of these models can generalize to numbers it have not seen.

link

eutectic 1596 days ago

That depends on the tokenization scheme.

link

moffkalast 1595 days ago

> this illusion is why big models are so popular. Memorise more stuff

It's all just a compressed lookup table that can handle in-betweens.

link

spupe 1596 days ago

I went and checked, it turns out for this version Eleuther-AI has in fact included math problems [1]. So my earlier comment is partly incorrect.

[1] http://eaidata.bmk.sh/data/GPT_NeoX_20B.pdf

link

asah 1596 days ago

And isn't it trivial to generate lots of correct sample data ? :-)

link

throwaway4good 1596 days ago

No. The poor performance comes from the overall approach of using neural nets to solve basic math problems.

link

FL410 1596 days ago

The cool part comes when the model can make the connection that

multiply 12345 by 87654

is the same as

def multiply_two_numbers(x, y):

return x * y

Which of course produces the desired result. The interesting part is that github copilot wrote the above with only the prompt "def multiply_two" as the prompt.

link

andreyk 1596 days ago

Oh, that's a pretty big difference, would be nice if post title was altered...

link