Hacker News new | ask | show | jobs
by moyix 1596 days ago
Hey! As the author of the gist, just wanted to clear up what seem to be a few misconceptions:

- This isn't GPT-3, it's the recently-released open-source and open-weights model from EleutherAI, GPT-NeoX-20B. GPT-3 is much larger (175 billion parameters vs NeoX's 20 billion).

- It's well-known that language models don't tend to be good at math by default (Gwern, among others, pointed this out back in June 2020). It seems likely that this is at least in part because of how these models currently tokenize their input (they don't represent numbers by their individual digits, but by tokens representing commonly-occurring character sequences): https://www.gwern.net/GPT-3#bpes . Someone also pointed me to this paper which looks at number representations (though it uses somewhat older models like BERT): https://arxiv.org/abs/1909.07940

- Despite the tokenization, it performs (IMO) surprisingly well at getting close to the true value, particularly for the start and end digits and the overall magnitude. You can see this by looking at the tokenization (indicated by brackets) of its guess vs the correct answer for 28531*8065 (I asked multiple times to get an idea of how consistent it is – it's not deterministic because I ran this with temperature = 0.1, which will use random sampling to get the most likely tokens):

  [What][ is][ 285][31][ *][ 80][65][?][\n][22][77][05][315]
                              Correct: [\n][23][010][25][15]
  [What][ is][ 285][31][ *][ 80][65][?][\n][22][95][01][115]
                              Correct: [\n][23][010][25][15]
  [What][ is][ 285][31][ *][ 80][65][?][\n][22][38][95][015]
                              Correct: [\n][23][010][25][15]
  [What][ is][ 285][31][ *][ 80][65][?][\n][22][99][25][015]
                              Correct: [\n][23][010][25][15]
  [What][ is][ 285][31][ *][ 80][65][?][\n][22][99][17][115]
                              Correct: [\n][23][010][25][15]
You can see that it manages to find things that are numerically close, even when no individual token is actually correct. And it compensates for different-length tokens, always picking tokens that end up with the correct total number of digits.

- Please don't use this as a calculator :) The goal in doing this was to figure out what it knows about arithmetic and see if I can understand what algorithms it might have invented for doing arithmetic, not to show that it's good or bad at math (we have calculators for that, they work fine).

6 comments

The "algorithm" isn't too mysterious, especially in light of your observation that it does better at the beginning and end digits. It's just doing what transformers do: predicting the probability of a token given the tokens it can attend to. Assume 20B parameters is enough to memorize an addition table. Then the first digit or two is relatively predictable, as are the last, and as is the length, aka the probability of a space token. The middle tokens are less predictable. This is consistent with the result.

Furthermore, it doesn't even really need to memorize the addition table in the explicit way this suggests. Think about the probability of certain digit tokens appearing given the presence of numbers and plus signs in its data. Thus a behavior consistent with having memorized an addition table emerges from mimicking its training data.

It's a little bit more complex here because tokens are variable-length. So getting the order of magnitude (i.e. number of digits) correct requires that it be able to pick tokens for the beginning and end that have the right start/end digit, and then figure out how to make the middle the right length.

And sure, of course it emerged from mimicking (or more precisely, learning to predict the most likely next token in) its training data – that's how it was trained, it can't have emerged from anything else :) But that doesn't tell us what the higher-level algorithm represented by the weights of the network is. I'm talking about things like this for understanding an algorithm for curve detection learned by a convolutional neural network: https://distill.pub/2020/circuits/curve-circuits/

Would it do better if you asked it to "show its work"? I.e. work it out in long form, one step at a time, like you'd ask a school kid to do. Maybe an example prompt would look like this:

    Work out 2241 + 19873.
    02241 + 19873 ~ ____4
    02241 + 19873 ~ ___14 carry 1
    02241 + 19873 ~ __114 carry 1
    02241 + 19873 ~ _2114 carry 1
    02241 + 19873 = 22114.
I'm not sure what is the best way to represent each step including details like carry digits. And you'd have to design a separate scheme for each operation.

If these models are symbol manipulators maybe the key is to break down the task into steps that are closer to being solvable with symbol manipulation.

I tried something like that for 3-digit multiplication with GPT-3 in another comment[1], successfully. You have to lay things out different manner than you did here, because GPT-*s have no sense of layout on a page; their byte-pair encoding destroys their ability to learn it efficiently. Further, transformers are optimized to look for things via similarity, because that's what typically occurs in text, so you're better off writing out things it can anchor off of.

There are ways to fix these issues, but BPEs micro-optimize for the primary text benchmarks that papers want good scores on so those are standard for now. I'm sure they'll get replaced eventually, once the costs outrun the wins and more scalable (alternatives to?) transformers become popular.

[1] https://news.ycombinator.com/item?id=30309302

I get the tokenization argument and it may influence it a bit, but I suspect the n-digit math issue has to do more with search the way it samples (in the bpe link gwern references some experiements I'd done with improving n-digit math by chunking using commas, http://gptprompts.wikidot.com/logic:math). I think since it samples left to right on the first pass, it's not able to predict well if things carry from right to left.

I think can mitigate the search issue a bit if you have the prompt double-check itself after the fact (e.g. https://towardsdatascience.com/1-1-3-wait-no-1-1-2-how-to-ha...). Works different depending on the size of the model tho.

Yup, quite possible that this has something to do with it. There is other work showing that giving LMs a "scratchpad" for intermediate computations allows them to do much better not just at arithmetic but also things like predicting the output of some code: https://arxiv.org/abs/2112.00114
definitely. also works on text translation/comprehension like emojis! https://aidungeon.medium.com/introducing-ai-dungeon-translat.... For actual benchmarks, scratchpad improves GPT-Davinci WIC from 50% accuracy (chance) to nearly 70%.

I think the check and validate is a different sort of scratchpad but maybe not. Seems like at least 3 types - soe for pulling implicit info out of the network viz wic, sometimes for intermediary steps viz coding, sometimes for verification like here.

The big caveat here is that the inner monologue papers generally work with GPT-3-175b, LaMDA, or Gopher, all of which are much bigger than 20b, and they generally show phase transitions (https://old.reddit.com/r/mlscaling/comments/sjzvl0/d_instanc...) in the monologue capability: below a critical size, inner monologue doesn't work at all, performing worse than baseline even, no matter how they scale, and only past the critical size does inner monologue suddenly start working much better. So it's possible (has anyone checked?) that GPT-NeoX-20b just isn't large enough to do inner monologue.
yeah, that's a very big caveat - haven't checked neo 20b yet. I've had a hard time getting the AI21 models to use it and those are also pretty big so it's interesting why sometimes it works and sometimes it doesn't. (and Davinci > Codegen Davinci > Curie > J-6B). Fine tunes can also learn to do the inner monologue as well which is really cool - not sure how much is architecture vs. training parameters.
I feel like attention would largely mitigate that, no? Has anyone looked at what the weights are while doing addition?
How can a language model invent algorithms for arithmetic? How would an algorithm be represented in a language model? Isn't that the first thing to ask, before starting to look for algorithms?

For example, if I take a stroll on the beach, am I likely to see any algorithms coalescing in the grains of sand?

Invent is probably the wrong word since it implies agency, sure. Maybe "discover" or "luck upon", since whatever it's doing was formed by updating a pile of floating point weights with gradient descent?

I think it certainly makes sense to ask what the higher level "algorithm" at work here is, though. Electrons flow through wires and transistors in (say) an adder [1]; looking at the wires and transistors you won't see an algorithm for addition, but there is certainly one present, codified in the arrangement of those wires and transistors. But maybe we can reverse engineer whatever the LM is doing by a combination of probing it with experiments like these and (maybe) inspecting the learned weights. The Curve Circuits paper did this for reverse engineering a curve detector learned by a convolutional neural network: https://distill.pub/2020/circuits/curve-circuits/

I also don't mean to imply that it's a good algorithm, or one that generalizes to arbitrary numbers, etc. Maybe it's just (effectively) a lookup table and some special cases!

[1] Please don't yell at me for this metaphor, I bailed out of physics after scraping out a B- in E&M ;)

Don't worry, I won't yell at you :)

I'm fine with "invent" actually, despite the implication of agency (I'm used to the terminology "predicate invention" [1]; although maybe I should actually re-examine the motivation behind it).

I'm more interested in the representation issue. I had a look at the quoted article on CNNs earlier. I think there is a very fine line between claiming that a CNN's weights represent an algorithm and that its weights can be _interpreted_ as an algorithm. I feel that the article leans too heavily on the interpretation side and doesn't make enough of an effort to show that the CNNs weight really represent an algorithm, rather than having activations in subsequent layers and therefore with a natural ordering.

In any case, I would like to understand how a language model can represent an algorithm.

_____________

[1] https://link.springer.com/referenceworkentry/10.1007/978-0-3...

> I think there is a very fine line between claiming that a CNN's weights represent an algorithm and that its weights can be _interpreted_ as an algorithm.

Yeah, I agree this is an issue. It feels a bit reminiscent of Searle's Waterfall argument, and so I'm inclined to turn to Scott Aaronson's response here [1; Section 6] – basically, how much work is the interpretation itself doing? If you actually tried to use the "algorithm" to do what your interpretation says it should do, how much work would you have to put into that mapping? If the work required amounts to just implementing the algorithm and effectively ignoring the CNN (or waterfall), then the interpretation is what was doing all the work.

IMO the Curve Circuits case passes this test, since they show that you can mechanically take out the learned weights, drop in the weights derived algorithm they reverse engineered, and the model still works about as well.

> In any case, I would like to understand how a language model can represent an algorithm.

Likewise! :)

[1] https://www.scottaaronson.com/papers/philos.pdf

Thanks for the link to Scott Aaronson's paper, that I hadn't read. I think your comment helped clarify something that bothered me with the CNN paper: I would feel more convinced if the claimed algorithm had been implemented _without_ a neural net. What the authors did was manually set the weights of another neural net. If an algorithm is discerned in the neural net's weights, then why can't it be programmed in the usual manner, in some computer language? If it can't, then that's very difficult for me to accept it as an algorithm, because I can't see what it is, exactly. The authors claim that the algorithm can be described in a few English sentences, but in truth these few English sentences are a high-level description of the behaviour of the CNN rather than a sequence of instructions (which is what I have in mind, perhaps erroneously, as an "algorithm").

I'm not necessarily asking for simplicity. I'm used to algorithms being relatively simple things that can be implemented in a few lines of code and I'd think of something more extensive as a "program" rather than an algorithm, but I appreciate that an algorithm encoded in the weights of a deep neural net could be something really big. I just want to see this algorithm written down in pseudocode at least, in a form that can be executed by a device other than a neural network (like me, following the pseudocode). I think that is the opposite of Aaronson's point actually.

I think I might be a realist:

https://youtu.be/STFcvzoxVw4?t=97

I mean it's just a generic blob of machine learning. It learned to do math because we write about math. It can't do arbitrarily long computations since it's only so recursive, but getting the basics down isn't too hard. It probably only struggles because it can't see digits, and can't do digit-wise operations (same with rhyming actually!).
What I did was train GPT-3 that I was asking a math question and then have it run some JavaScript to do the math with the text it thought was math formulas to get the answer.
Yeah it's interesting how moving up a level of abstraction works well here! "Write me a function that multiplies two numbers" works much better than trying to get it to multiply the numbers themselves. There's a recent-ish paper exploring this:

https://arxiv.org/abs/2108.07732

i'd first ask it all single digit arithmetic, then two digit with and without carry and then go from there. longer strings are going to be confusing unless you're looking for pieces of them in the training data, methinks.

i suspect you could probably train a GAN to do binary or base 10 arithmetic, but have never tried or searched for papers.