Hacker News new | ask | show | jobs
by LifeIsBio 1145 days ago
The game “20 questions” is probably the hardest I’ve seen chatGPT fail.

What’s interesting about the game is that, at first pass, there’s no ambiguity. All questions need to be answered with “Yes” or “No”. But many questions asked during the game actually have answers of “it depends”.

For example, I was thinking of “peanut butter” and chatGPT asked me “Does it fit in your hand?” as well as “Is it used in the kitchen?”. Given my answers, chatGPT spent the back half of its questions on different kitchen utensils. It never once considered backing up and verifying that there wasn’t some misunderstanding.

I played three games with it, and it made the same mistake each time.

Of course, playing the game via text loses a lot of information relative to playing IRL with your friends. In person, the answerer would pause, hum, and otherwise demonstrate that the question asked was ambiguous given the restrictions of the game.

Regardless, it was clear that chatGPT wasn’t accounting for ambiguity.

8 comments

> It never once considered backing up and verifying that there wasn’t some misunderstanding.

Of course not; ChatGPT doesn't "consider". It doesn't think, it doesn't know. It can't identify that there was a misunderstanding of its own volition.

All ChatGPT does is use a (very sophisticated!) statistical analysis to generate text that conforms to an expectation of what a human response to a similar prompt might look like. It has been trained well in so far as it is able to produce prompts that seem like a human may have written them, but it doesn't reveal cognitive processes like "reconsidering" because it doesn't have any.

Wow never heard this comment before
Comments of that nature will continue so long as there are people who don't understand how language models work (or choose to misrepresent them).
20-some years ago, I had this "20 questions" handheld electronic game that was eerily good at winning. I imagine it was a bunch of well-programmed tables of data, but in any case, it's certainly possible for a machine to do well at this game.

I think the more we see ChatGPT do things like "oh, I know this game -- I'm going to run a 20-year-old 20 Questions subroutine that is not part of my neural network language model to generate responses", it will become even more impressive.

> I think the more we see ChatGPT do things like "oh, I know this game -- I'm going to run a 20-year-old 20 Questions subroutine that is not part of my neural network language model to generate responses", it will become even more impressive.

Agreed. Incidentally I’ve built a little toy version of a runtime for exactly this purpose - there’s a translation layer that’s given a bunch of available “APIs” (fed through the LLM context), and breaks down a high level goal into a structured series of API calls.

the runtime parses these API calls, and natively executes some (e.g. run a program, write to the file system) and others result in LLM invocations.

I’m sure OpenAI and crew are way ahead of me here, of course. I’m excited to see what the future holds in this field.

The first AI-style program I ever wrote (about 25 years ago. Yes, I'm old) played 20 questions, but it would "learn" from prior games, so the more you played, the better it performed.

It got extremely good after a few hundred games.

Yeah, ChatGPT could integrate Akinator[0] and trivially be great at the game. Without the help, though, It's a good, revealing benchmark for the LLMs ability.

[0] https://en.akinator.com

LLM for the foreseeable future function most reliably as a user interface layer for other system. I use GPT to “translate” natural language down into the API calls that get real data and it works great. I’d never trust it beyond that.
You trained it with "this phrase means this command" examples? How do you make it use your custom API? (Or you are not using your custom API?)
Basically yeah, just a pretty detailed set of prompts and then “turn the next message into an api call” and it basically works perfectly.

When I first heard the term “prompt engineer” I rolled my eyes, but now that I’ve gotten into it I see it’s really an art form.

"Green Glass Door" also completely stumped it. It just could not deduce that the trick was semantic at the word representation level, rather than something related to the object that the word describes.

What's funny about 20 questions is that Akinator has been absolutely slaying it for like 20 years now.

What happens if you answer with something approximating the hemming and hawing rather than a straight yes or no? You can encode that into text, it's just less common outside of very informal chat conversations.
I just did a 20-questions with it, and was surprised by how bad gpt4 did. Then for fun, I turned it around and had me be the guesser. It's weird and surreal to play 20-questions when you know that the clue-giver doesn't have an answer in their mind (or more literally, there isn't a single answer in any stateful form while you play), but is instead just eventually saying "yes that's what I was thinking of" when it's statistically appropriate.
With the code execution plugin, one could theoretically ask chatgpt to generate a salted hash of their answer at the start that's revealed at the end to prove it was correct.

Without any plugins, chatgpt will happily return sha hashes and salts when I asked it to play rock paper scissors this was. The only trouble was, the hashes were totally wrong.

i love your example, i wonder if this kind of game can be implemented in future training scenarios

we as humans understand ambiguity so much easier because we learn to speak and interact before we write, and writing ambiguity is way less obvious if you've never experienced it

I'm not sure I would think "food" when someone says they "use [it] in the kitchen". You "use" food? (Used in cooking != used in kitchen, imo)
I use food (including peanut butter) in cooking. I cook in the kitchen. Therefore peanut butter is a thing I use in the kitchen. Seems correct and proper to me.

The ambiguity as I see it is that the kitchen isn't the only place I use peanut butter. I've eaten it (which I think counts as "using") in other rooms. I've even made peanut-butter sandwiches (properly "using" it) in the living room before.

That's his whole point. It's possible to consider it technically correct, but it's a red herring.
Well, the alleged point is challenged. If playing this game, the questioner must constantly verify that the other party is using the language properly, you'll exhaust that 20 q limit rather quickly.

- is it used in the kitchen?

- yes.

- [well, kitchen appliances, here we go ..] is it ..?

...

- [aha. meat intelligence no speak proper English?] Is this thing you use in kitchen edible?

- Oh, yeah.

- [oh dear. we can not let meat machines govern this planet...]

I use peanut butter as an ingredient for sandwiches, usually in my kitchen.
Yes. You use edible things in preparing or cooking food (which may happen in the kitchen). 'Use' maps to food prep (the act) but never to prep location. Only in cases where the thing has both general edible and food preparation usage -- "I use honey extensively in the kitchen" for example -- does "use" and "edible" make sense.
But peanut butter has general edible and food preparation usage quite similar to honey, doesn't it? You can spread it on a slice of bread to eat directly or use it as a baking ingredient, but you probably wouldn't eat it by the spoonful straight from the container. (Or maybe that's how people usually eat peanut butter, I kind of don't want to know.)
guilty as charged: spoon + jar = happy mouth.
Yes, I do.
"He saw that gas can explode."

This ambiguous sentence stuck in my head some 30 years ago, when the AI was popular at that time.

There was a research paper discussing the issue of ambiguity.

Right -- although many things that are ambiguous in text are disambiguated in actual speech, so the problems that arise with audio speech are not wholly the same as with text.

A classic example is the word "record", which has first syllable stress as a noun, but second syllable stress as a verb. "I bought a RECord" vs "Please reCORD the music".

(in the dominant American dialect; I don't recall about other dialects/countries)

An interesting reprint in 2003

https://www.drdobbs.com/parallel/understanding-natural-langu...

"Computers still cannot understand natural language as well as young children can. Why is it so hard?"

Source: AI Expert, May 1987