Hacker News new | ask | show | jobs
by packet_nerd 943 days ago
Here's a question to ChatGPT I just made up:

>> A magical frog was counting unicorns. He saw 5 purple unicorns, 2 green unicorns, and 7 pink unicorns. However, he made a mistake and didn't see 2 unicorns: one purple and one green. Also, since he was a magical frog, he didn't see unicorns that were the same color as himself. How many unicorns did he count?

It correctly answers 11 for me.

To me this has demonstrated:

* "Understanding": It understood that "didn't see" implies he didn't count.

* "Knowledge": It knew enough about the world to know that frogs are often green.

* "Reasoning": It was able to correctly reason about how many should be subtracted from the final result.

* "Math: It successfully did some basic additions and subtractions arriving at the correct answer.

Crucially, I made this up right here on the spot, and used a dice for some of the numbers. This question does not exist anywhere in the training corpus!

I think this demonstrates an impressive level of intelligence, for what up until about a year ago I thought a computer would ever be capable of in my lifetime. Now in absolute terms of course current gen ChatGPT is clearly far less good at reasoning and understanding than most people (well, specifically it seems to me that it's knowledge and reasoning are super-humanly broad, but child-level deep).

Can future improvements to this architecture improve the depth up to "AGI", whatever that means? I have no idea. It doesn't automatically seem impossible, but maybe what we see now is already near the limit? I guess only time will tell.

3 comments

This puzzle is too poorly-worded to be solvable, due to the ambiguous nature of "see" and "count". Could you describe what the actual situation was, what the frog perceived it to be, and what color the frog was?
Ok, here's a (hopefully) better worded puzzle, again made up by myself right now.

There are 12 frogs. Five are green, 3 red, and 4 yellow. Two donkeys are counting the frogs. One of the donkeys is yellow, the other green. Each donkey is unable to see frogs that are the same color as itself, also each donkey was careless and missed a frog when counting. How many frogs does the green donkey count?

GPT4 answers 6 every time for me.

My point is that GPT is capable of a certain amount of "reasoning" about puzzles that most certainly don't exist in it's training data. Playing with it, it's clear that in this current generation the reasoning ability doesn't go very deep - just change the above puzzle a little to make it even slightly more complicated and it breaks. The amazing thing isn't how good at reasoning it is, but that a computer can reason at all.

So what color was the frog supposed to be in the original question?
Green of course? Anything else would be highly unusual and a normal reader would expect it to be called out.
The correct answer is 14 ... the frog counted what it saw and it saw 5, 2, 7 unicorns.
It clearly says he didn't see some of them either at all or as unicorns. The correct answer is 11.

Edit: I do see now that "He saw" kind of messes the question up. My intent would have been better expressed with "There were". But again this proves my point! GPT4 is able to (most of the time) correctly work through the poor wording and interpret the question the way I meant it, and I think the way most people would read it.

the correct answer is 14. there is no logic/linguistic/semantic reason why "he didn't see a purple unicorn" should refer to the purple unicorn that he (according to your statement) did see. "he saw a red ball, but he didn't see one ball: a red one. how many balls did he see?". also regarding the green one ... there is no _logical_ reason why a "magical" frog should be green ... one can debate long about your question but a semantically sound interpretation implies: the frog saw 14 unicorns and the frog is not green. anything else falls apart because if the frog is green then how could he have seen a green uni? which is what you wrote for context.
Do you disagree with my claim that GPT-4 can perform some sort of basic reasoning about puzzles that aren't in it's training data?
It answered 16 for me. Then 10 when I tried again. Then 12. And 15.
ChatGPT 3.5? I'm using 4 and get 11 most times but other numbers occasionally.
Tried with GPT4, I got 12.
and neither is correct. the right answer is 14.