Hacker News new | ask | show | jobs
by sillysaurusx 1596 days ago
If you want to play with the model, you can (with difficulty) for free at https://goose.ai/playground. You have to log in, but thankfully you can via google.

The playground crashes every minute, and the defaults ruin your outputs (temperature 1, really? 0.7 to 0.8 is a necessity, with top-k 40), and they turned off autocorrect on mobile, presumably because they hate you and your family for owning an iPad, but you can indeed play with it.

The outputs feel pretty magical, too. With the settings above, it started printing... an IRC conversation? https://gist.github.com/shawwn/9a201990196b61cd21847487185dd... This is impressive, because I'm not sure we explicitly included any IRC logs in The Pile.

re: the current title "GPT-3's answers to arithmetic questions": We've come full circle. I used to give Eleuther a hard time for confusing people. But now that people confuse themselves, they should declare victory. It's as close to success as an open source effort could hope for. And with only years of work -- not too shabby.

You can join them: https://www.eleuther.ai/faq/

GPT-NeoX-20B paper: http://eaidata.bmk.sh/data/GPT_NeoX_20B.pdf

4 comments

For what it's worth you can get access to the OpenAI playground. You might have to put yourself on a waiting list.

https://beta.openai.com/playground

I tried a couple different things to pick it apart. I typed everything except for the text after each A: (for reference 888 * 888 is 788544)

    I am a math savant, I can do arithmetic in my head. 
    I will provide an answer, then show how I arrived at 
    the answer.

    Q: What is 8+8?
    A: 16
    
    Q: How did you arrive at that answer?
    A: 8+8=16. 16 is two groups of 8.
    
    Q: What is 888+888?
    A: 1776
    
    Q: How did you arrive at that answer?
    A: 888+888=1776. 1776 is two groups of 888.
    
    Q: What is 888*2?
    A: 1776
    
    Q: How did you arrive at that answer?
    A: 888*2=1776. 1776 is two groups of 888.
    
    Q: What is 888*10?
    A: 8800
    
    Q: How did you arrive at that answer?
    A: 888*10=8800. 8800 is two groups of 888*10.
    
    Q: What is 888*888?
    A: 775648
    
    Q: How did you arrive at that answer?
    A: 888*888=775648. 775648 is two groups of 888*888.
    
    Q: What is 888 groups of 888?
    A: 775648
> How did you arrive at that answer?

This sort of question - "you don't remember what I just asked you?" - is one of the most glaring failings in Apple's Siri and I hope this leads to it being solved soon.

That's the biggest problem with dialog AI systems - they don't do well with context. Context requires some kind of model of what is being talked about. Most systems don't have that, or they just have some predefined "slots" to fill in about what's being discussed.

Whoever really cracks this will own the customer support business.

it could be interesting to gauge how entwined the “how did you arrive at that answer” process is with the answering itself. i.e. which paths do they share? even at this early a stage: is there some structure which is used to determine the operand(s) that’s leveraged in both of these prompts? is the “how did you X” answer leveraging most of the “X” circuitry and just bailing out early? or does it deviate as early as post-tokenization?

philosophers would like to know.

My initial guess is they have nothing to do with each other. It would be like explaining why the next idea pops in your head. You can create a rational explanation but there's no way to test it.
my thoughts too, based on limited understanding of GPT. but the more pressure you apply towards compressing the neural network during training, the more circuitry these paths are likely to share. it would be interesting to see just how much and which parts could be folded together before you start to lose significant fidelity (though unfortunately the fidelity seems too low today to even try that).
> You have to log in, but thankfully you can via google.

... thankfully you can also not login via Google.

Anyway it doesn't work even slightly. When I simply try 4+4= it prints some source code afterwards. A second attempt wrote some printf() call which computed 6+1 instead (but not giving the result, just the code). When I try what someone else shared below with a Q: and A: syntax I get this:

Q: What is 8 + 8? \n A: \n The answer looks confusing. \n How do you work out 8 + 8. What is the rule?

Q: What is 8+8? \n A:

I got the answer as 12, so it is 1+2.

(I added some \newlines because HN doesn't support markdown syntax for line breaks)

On the right are some parameters, including the model name. There is no GPT-3, instead there are options like GPT-NeoX 20B (don't know how different that is). Maybe that's the problem, or it needs a different ...temperature?

Well yes, if you do things in slightly different ways, you’ll get hugely different results. It’s a bit like saying you added bacteria to a controlled experiment on sanitization.

Try with temp 0.1, with exactly the text in the OP, including capitalization.

Seeing this thread was renamed: the thread was called GPT-3 before and GPT-NeoX-20B was added much later. It was not clear to me that OP was using this same model.
Thanks for that. I've played around a little bit.

> What is 123456789 - 123456789?

> 123456788

> What is 123456789 * 0?

> 123456789

Not even near.

It didn't surprise me that the model failed to handle cases above, which are unlikely to present in the data set.

Try temp 0.1 top-k 40. For math, it matters to have an unthinkably low temperature. It’s what generated the results in the OP.

What is 12345 - 12345?

0

What is the distance between -0.1 and -0.01?

0.09

One neat example just now:

What is 12345 divided by 12?

4115/4

I pointed it out above; even though it is text, the ASCII representation is just a different base for the numbers - base 2^8 - ('325' is '3' * (2^32) + '2' * (2^16) + '5' * 2^8 = 51 * 2^32 + 50 * 2^16 + 53 * 2^8); it should approximate those polynomial functions very well.
Hmm. I’m not sure what you mean. Temperature is randomness; low temp is to get the most probable least random result. It’s what chess engines do during tournaments, for example.

The other parts seem unlikely. It has no knowledge of bases, except insofar as they appear in the training set. I saw this in our GPT chess work — even with strange tokenization, it learned chess notation well.

Sorry, I thought it was clear. A neural network, when untrained is just random noise that multiplies inputs by random weights over an over (+ normalization) until it reaches the output. When you train it with inputs whose outputs have are the process of applying some polynomial to those inputs, the weights can be set so that the output very closely approximates that polynomial. It never needs to know the base, and less randomness will help because the computations within the neural network match very well with the function you want to approximate. Still, it is not as simple, as outputting the correct ASCII representation is a challenge for example when carry is involved (100009999999999 + 1), however, the emergence of good arithmetic from a neural network itself should not be shocking.
You are clear but mistaken.

I give you points for creative thinking, but it’s important not to make inferences that “feel correct.” No matter what your gut is telling you, I would happily bet $10k that the emergence of arithmetic has nothing to do with the things you mention.

If an alternative training scheme were devised that didn’t rely on any of that, it would still result in a model that behaved more or less the same as what we see here. The properties of the training process influence the result, but they don’t cause the result — that would be like saying your vocal cords cause you to be an excellent orator. Vocal cords don’t form the ideas; the training process doesn’t form the arithmetic.

What we’re seeing is a consequence of a large training dataset. The more tasks a model can perform, the better it is at any individual task.

Does GPT know about ASCII? My understanding was that these models use a dictionary of (initially) random vectors as input and learn their own text representation.
In that case, I would say that GPT's performance in arithmetic is something that we see because we are looking for it or want to find it but that is not there. It is an illusion. If we have no theory of why would it an arithmetic capability would emerge from GPT, then, there is no scientific discovery; at most, there a field survey, a taxonomist work, but no understanding is generated.
Thank you for the feedback! We're working on improving the playground and I hope those issues will be a thing of the past soon.
Love the inference speed. Best demo anywhere in that regard. Nice work.