Hacker News new | ask | show | jobs
by cj 1122 days ago
I asked ChatGPT to tell me a riddle.

It was “What is always hungry, needs to be fed, and makes your hands red?” (Or something like that)

I asked for a hint about 5 times and it kept giving more legitimate sounding hints.

Finally I gave up and asked for the answer to the riddle, and it spit out a random fruit which made no sense as the answer to the riddle.

I then repeated the riddle and asked ChatGPT what the answer was, and it gave me the answer (“Fire”) which makes sense as the answer to the riddle.

But it was giving extremely bad hints, like “it starts with the letter P” and “it’s a fruit”.

That was a great way to show my non-tech family members the limitations of AI and why they shouldn’t trust it.

Playing “20 questions” with ChatGPT is another great way to expose its limitations. It knows the game and tries to play, but is terrible at asking questions to narrow down possible answers.

There really needs to be some confidence or accuracy score/estimation displayed alongside its output.

Or, learn how to say “I don’t know”

23 comments

I've had a few moments with ChatGPT that are great anecdotes similar to your own:

- Asked it to generate a MadLib for me to play that was no more than a paragraph long. It produced something that was several paragraphs wrong. I told it "no. That's X paragraphs. I asked for one that is only 1 paragraph long" and it would respond "I'm sorry for the misunderstanding. Let me try again" and then would make the same mistake. It never got it right

- Asked it, "Can you DM a game of Dungeons and Dragons?" and it said something like, "Yes! I'd love to DM a game of Dungeons and Dragons for you". Dumped some text to the screen about how we'd have to adapt it some. I asked it to begin, and it asked a few questions about the character I would want to play. I answered the few questions it asked. Then it finally dumped a page of text to the screen as "background" to my character and the quest I was going to embark on. Then it said something like, "You win. Good job! Hope you enjoyed your quest!"

I showed these to my family and they were all a little deflated about AI. Like they realized how willing it was to pretend like what you wanted and just make up its own answers.

There's definitely a potential for a D&D DM with an LLM, but you'd need a lot of careful prompting and processing to handle the token limits today's models have. Simply put: a d&d game has more story and state than the 30,000-ish words an LLM can think about at once.

I think there's a lot of interesting opportunities there.

I've also heard (here) that after you get 20-ish questions into an instance you start getting the really weird output. Some of the conjecture was because that's about how deep they trained.

In any case, if that's true, that's a very short role playing session, unless there's a good way to retain info but reset the state that accrues and causes problems (if indeed that happens).

Yeah, but you could ask the LLM to do a few different things at each step.

You could provide it with the background, the story, the secrets, and the summary of everything that has happened so far- as well as what new things have taken place. Then ask it to re-write the summary of the story so far.

Separately, you could give it all that context and what the players have asked of it, and as what response to give.

As well, you could be recording all the events that have happened in a vector store, and do a search on it when players ask questions, and use those as context to the LLM when asking it what to reply.

There's lots of neat tricks we can use to help an LLM overcome it's limitations.

Aetolia tried this in a 24 hr experiment: https://nwn.blogs.com/nwn/2023/04/aetolia-mud-virtual-world-...

Sounds like it was a success! I suppose it comes down to cost - I think it'd be fun to try a single player game authored like this and would be willing to use my own API token to try it out.

'AI Dungeon' has been a real product since GPT-2. You can buy it on Steam.

https://en.wikipedia.org/wiki/AI_Dungeon

That’s the whole point of these "agents", and things like LangChain or LlamaIndex.

Haven’t gotten around to that part yet, it seems it could help.

I find those example quite encouraging, actually. In that it shows that the current SOA is still pretty far off from creating a planet-killing and/or species-ending AI.
Did you really expect ChatGPT to be a full blown dungeon master?
With all the hype about it, the average person does.
Eventually, sure. But the point is that ChatGPT claims to be. But then does nothing more tell a story and then say “You win!”
I need to know what version of ChatGPT you were using, because this is a critical piece of information that everyone just blatantly ignores, and I can only imagine that it's out of ignorance of the significance of the difference.

This is what happened when I asked ChatGPT 4...

ME

Give me hints without outright telling me the answer to the riddle: "What is always hungry, needs to be fed, and makes your hands red?"

ChatGPT 4

Hint 1: This object can be found in many homes, usually in a specific room.

Hint 2: It is not a living creature, but it does "consume" things.

Hint 3: The part about hands turning red relates to a common physical response to a certain temperature.

Hint 4: It's a crucial tool for a common household task.

Hint 5: While it can't eat in the traditional sense, you do put things into it quite frequently.

ME

What's the answer?

ChatGPT 4

The answer to the riddle is a "fire" or "fireplace." It's always hungry for more fuel (like wood), needs to be fed to keep burning, and the heat from the fire can make your hands red.

If I were still able to edit my original comment, I would add a note at the bottom that says to take the experience as a casual person downloading an AI app after hearing about it on the news.

Such as a lawyer who’s not particularly tech savvy.

The main point is it’s irresponsible to trust LLM output for any critical/important purpose because it’s not perfect. But too many first time users think it is perfect and trustworthy at face value, when it’s not.

I don’t actually know the version since I was interacting via an unofficial iOS app using some LLM under the hood. It may not have even been ChatGPT.

Then why did you specifically claim that you asked ChatGPT? These distinctions, while small, are important. Not really the type of mistake to make for a tech-savvy forum.
The iOS app that I had this interaction on was ChatOn.

https://apps.apple.com/us/app/chaton-ai-chat-bot-assistant/i...

The subtitle on the Apple App Store is "Powered by ChatGPT & GPT-4".

If you're to believe the app's advertising, it is powered by ChatGPT GPT-4.

I guess the point is GPT-4 hallucinates, too. Maybe it did well for this example but still a lawyer should not trust its output.
Maybe, but it's surprisingly good in the face of all the non-version-indicating complaints about how terrible people think it is. Mostly I doubt that the lawyer was using GPT4, because the lawyer sounds like the kind of person who would be ignorant of the significance of the difference.
The kind of person too lazy to check the output of a computer program before submitting it to a court of law is the type of person too cheap to pay $20 for the good version of the program.

Think: Lionel Hutz.

No, checking was done!

"Oops, I'd better remove that comma".

Here is what poe told me:

“Think about things that you might need to feed regularly, especially things that might leave stains on your hands. It might be something that you encounter frequently in your daily life.”

“ The answer to the riddle is "fire". Fire is always hungry for fuel, needs to be fed with wood or other combustible materials, and can make your hands red if you handle it or if you're cooking food over an open flame.”

And second time:

“This thing is a natural phenomenon that can occur in different situations and locations. It is often associated with warmth and light, and it can be both helpful and dangerous. People have been using it for various purposes for thousands of years.”

“The answer to the riddle is "fire". Fire is a natural phenomenon that needs fuel to keep burning, and it produces heat and light that can make your hands red. Fire has been used by humans for various purposes for thousands of years, such as cooking, heating, and providing light. However, fire can also be dangerous if not handled carefully.”

I am amazed at how a free bot is basically as good, if not better, as a openai’s bot. I am quite certain locally run llms will also be able to outcompete chatgpt. Probably why sam altman is desperate to block them.

When the answer is something ridiculous or stupid, it's 95%+ of the time GPT-3.5-turbo and is rarely disclosed by the other party. GPT-4 is an order of magnitude better, if not two orders of magnitude better.

It's hard to tell if the party crapping on ChatGPT is doing so out of ignorance or malice.

Finetuning with GPT-4 can't come soon enough...

Wow. That's kind of terrifying.
That's GPT-4, not ChatGPT (3.5-turbo I think).

Also, yes you can get correct information by tailoring your prompts, but that isn't the issue. The issue is that some prompts lead to bad results and confusing/incorrect answers. You changed what OP queried by providing the riddle and asking for hints to that riddle, whereas OP asked for a random riddle and then hints to that riddle.

> That's GPT-4, not ChatGPT

It absolutely is ChatGPT, the paid monthly "Plus" version, using the GPT4 model instead of the 3.5 model.

> Or, learn how to say “I don’t know”

It doesn't know that it doesn't know! It is, very roughly speaking, a model that is designed to print out the most likely word given its current input and training, and then the next word etc.

Whereas you or I might be mistaken about some of our faculties, memories and skills, ChatGPT cannot possibly "know" what its limitations are. It was never taught what it was not taught (obvs) nor can it conceive of what it was not taught about. It might come back with some useful chat in this regard but it cannot reason.

It seems that you don't know what you don't know, really.

There's no way to definitively know what properties ChatGPT has. It does seem to reason to some extent and it does often say that some information isn't known/there's no data. And it almost obnoxiously often tells you that it's simplifying a complex and multifaceted situation.

Its a model that takes an input and spits out the most likely output given its training.

"There's no way to definitively know what properties ChatGPT has." - yes there is: ask it how the war in Ukraine is progressing or some other time based thing. It stops in 2021.

It is a really useful tool but it isn't sentient.

I don't agree. It's like saying humans are just animated meat. You're massively simplifying the situation.
I never mentioned humans or meat yet you decided to pop those words in my mouth (mmm tasty).

I merely tried to very quickly give a hint as to what ChatGPT is - it's a LLM and that is something that takes a (normally textual) input and outputs an expected best guess word for an answer and then does it repeatedly until the full answer is returned.

It doesn't know things. I promise.
> Or, learn how to say “I don’t know”

This is the correct answer. It is like a sad salesman who is out of his depth, but decides to keep bullshiting!

I'd rather a confidence score for each response. The last thing I need is another reason for the AI to ignore the question or feel the need to explain why it was ignoring it.
Well, that tells you a lot about:

1. The people designing it (either optimists or looking for a quick exit).

2. The learning set they're using, which I believe is some kind of internet crawl of sorts? I imagine humanity, as a whole, bullshits its way through most of its life.

It makes sense that there wouldn't be a lot of training data that consists of people admitting they don't know - forums like stack overflow wouldn't even permit such a post/comment anyway. So presumably ChatGPT is very heavily weighted against giving "I don't know" as an answer when virtually all the examples of people asking questions in its training data are followed by plausible sounding answers.
It is almost certainly case 2. People don't often write text on the internet just to say that they don't know. It doesn't even make sense to post that most of the time.

A great example would be on a Q/A forum or something like Stackoverflow. It better to let someone else answer when you don't know.

or a politician. even when presented with incontrovertible evidence, double down
> That was a great way to show my non-tech family members the limitations of AI and why they shouldn’t trust it.

These are the limitations of the version of ChatGPT you were using at that moment. They are not categorical limitations of AI or even LLMs.

It’s amazing to me how many people are sleeping on AI, mixing up the failing cases of a freemium chatbot for the full capability of the tech, even on HN. LLMs can say “I don’t know”. Even ChatGPT can do it. Ask some super niche historical questions of any version and see what you get. Is it perfect every time? No. But that’s something that can be reduced.

Over the next year, you’ll see more instances of lawyers citing hallucinated cases. There will also be a handful of startups that hook up LLMs to document stores, and they’ll be able to check for this sort of thing and do an even better job.

> LLMs can say “I don’t know”. Even ChatGPT can do it.

That's the problem in my opinion. When you know something is capable of saying "I don't know" but confidently spits out some hallucinated BS is when the average person eats it up.

It is definitely a problem. OpenAI does a lot to warn people, but I’m not really sure it’s enough.
I don't know exactly why, but for some reason this made me think of qAnon, and now I'm thinking of an AI trained on qAnon theories that people can form a community around like they did qAnon, and frankly that's one of the most terrifying things I've thought in quite a while.
I remember someone built a 4chan robot and posted it to HN. robot immediately display the terrible part of there.
I made https://AskHN.ai

What it does is not try to answer, but collect previous topics discussed by experts. Then answer the question based on the text, a far more reliable approach.

How does it qualify experts? I love the discussion here but if it turns to international nuclear strategy or the minutae of electrical networks (or presumably anything outside the regular wheelhouse) I notice that the quality goes down but the confidence stays the same.
Under the hood it builds and ranks the expertise of everyone in the network. That said, it doesn’t have knowledge outside the network, so if the network itself has low quality experts or no data, it’s going to give subpar results
Maybe the real definition of intelligence isn't being able to answer questions effectively, but being able to know what you know and know what you don't know.
There's truth to this, but in a time and place. There are times being very specific about what you know and don't know is important, but most of the time we are learning little by little, and we benefit from saying things we aren't entirely comfortable with, if only to figure out whether they are true and/or socially acceptable.
Or, learn how to say “I don’t know”

It can't. Correct me if I'm wrong - I probably am - but as a generative language model I was under the impression it has no actual understanding of the meaning of it's words, only the probability they work well together.

You wouldnt expect dice to understand why you didn't want to roll snake eyes

It could if it had enough training data that demonstrated "I don't know" is a perfectly valid answer to a question. The risk than is that it answers "I don't know" even when it is actually capable of providing a useful answer. In fact I've had it do that before too - not exactly "I don't know" but vague statements that aren't particularly helpful, when it turns out with the right prompt it "knew" exactly what the answer was.
ChatGPT 3.5 or GPT 4?

Almost every negative comment about LLMs is by someone using an older, weaker model and making generalisations.

Here’s GPT 4 giving me a riddle: https://chat.openai.com/share/1753ce5a-d44d-44ac-bc97-599a26...

> But was it GPT4

I keep seeing this cop-out, which ignores that it's fundamentally the same architecture, and has the same flaws. More wallpaper to hide the cracks better makes it an even worse tool for these use cases because all it does is fool more people into thinking it has capabilities that it fundamentally doesn't.

I don't think this is a fair argument. If we compare a GPT4 architecture with 5,000 parameters and a GPT4 architecture with 1 trillion parameters, should we judge the capabilities of both by the 5,000 parameter version, because they're both the same architecture?

There is more than architecture that can set them apart as well. GPT4 may have been trained by a slightly different algorithm, or on different data, and this can result in fundamentally different results.

Most of these conversations are not focused on one specific version, but are about the capabilities of LLMs in general, and it is implied we are talking about state-of-the-art LLMs, and GPT3 is no longer state-of-the-art.

This is nonsense. It's not a cop-out to say "use the latest, most capable model before complaining". Anyone remotely close to this field knows model size matters, amount of training data matters, quality of training data matters, and several other variables matter. Even if someone knows zero about it, just using 3.5 v 4 is enough to see they are two different things. Like a lizard v a human.
It's still fundamentally the same, hallucinates just the same, and anthropomorphizes itself as a confident, knowledgeable, intelligent being just the same. A newer, better, faster, more capable car still isn't an airplane, even if it go fast enough to spend several seconds in the air.
Sure, and 40 year olds have the same capabilities as 4 year olds, because "same architecture" or "fundamentally the same". And putting random weights inside the GPT-4 model architecture should behave "fundamentally the same" as the trained GPT-4 weights, because it's "same architecture". Forget this "training" stuff.
It's not a person, it's a machine. And it's one that will still produce hallucinations that embarrassingly prove that it has no notion of intelligence, and do so confidently. That it does so less than it's sibling is entirely irrelevant.
To me it's a bit like someone making the claim "humans are flawed, and we should think critically about the things they say", and someone responding with "well which human are you talking about? Because Einstein is orders of magnitude above the Walmart checkout guy".
I have asked it to write code. The code looked reasonable but did not work (using a non-existent API or library). I said the library XXX doesn't exist. And it would respond with an apology and a new solution that maybe worked maybe didn't. Sometimes it could pull off a save eventually.
i keep getting it to write code that works but then its example usage makes no sense. i point out the code is write but the example it gives is wrong and it can almost never fix it. version 4.
> Playing “20 questions” with ChatGPT is another great way to expose its limitations. It knows the game and tries to play, but is terrible at asking questions to narrow down possible answers.

GPT4 did well enough for me. The questions below are asked by GPT4, and the yes/no answers are my responses to GPT4. I think it did quite well at 20 questions.

Is it a person? No. Is it a place? No. Is it a tangible object? Yes. Is the object man-made? Yes. Is the object typically found indoors? Yes. Is it an electronic device? No. Can it be held in one hand? Yes. Is it used primarily in the kitchen? No. Is it related to a specific hobby or activity? Yes. Is it used in a sport or physical activity? No. Is it used in creating or making something? No. Is it associated with a board game or a card game? No. Is the object related to music? Yes. Is it a musical instrument? No. Is it used to control or manipulate sound? Yes. Is it typically used in the process of recording music? Yes. Is it a type of microphone? No. Is it used to alter or modify the sound of a musical instrument during the recording process? No. Is it related to the control or adjustment of volume or sound levels? No. Is the object used to store or convey recorded music? Yes. Is it a vinyl record? No. Is it a compact disc (CD)? Yes. (22 questions total.)

GPT does nothing but predict the next word (token, to be more exact). If you ask it for hints, it's going to come up with some hints that sound like they could be plausible in context of the original riddle. If you then ask it what the answer was, it's going to try to take all the earlier hints into account, because that's now the text it's trying to complete.

There's no entity sitting on the other side of the screen thinking about riddle's solution from the start. There's just a model that completes the given text one token at a time. You should expect similar results as if it was bunch of people passing a single piece of paper to each other, each writing a single word on it, not communicating with each other in any other way and trying their best to come up with coherent result. It's not going to make sense all the time.

> You should expect similar results as if it was bunch of people passing a single piece of paper to each other, each writing a single word on it, not communicating with each other in any other way and trying their best to come up with coherent result.

I'm pretty sure GPT-4's context window is bigger than '1 token'.

Yes, obviously...? A piece of paper holds more than a single word as well. That piece of paper is the context in this analogy.

It still generates one token at a time and can't backtrack afterwards.

> Or, learn how to say “I don’t know”

That would require semantic understanding the LLM doesn't possess.

>Playing “20 questions” with ChatGPT is another great way to expose its limitations. It knows the game and tries to play, but is terrible at asking questions to narrow down possible answers.

The Rise of the Machines will be staved off as long as ChatGPT doesn't absorb the powers of the mighty Akinator

A colleague tried 20 questions with ChatGPT and the answer they'd chosen was "Margaret Thatcher" (UK Prime Minister, the "Iron Lady")

ChatGPT got as far as basically narrowing it down to post-War UK Prime Ministers, which is fairly impressive although it only had a few questions left. Then though it decided the answer must be "Winston Churchill". Churchill isn't meaningfully a post-War PM. He lost the July 1945 General Election, which was before the Pacific victory.

It did guess Maggie, with nothing left, at a point where I don't think it had ruled out Blair, Cameron or Heath, let alone say, Liz Truss, but guessing Churchill first shows the limitations of such a model.

Churchill was also prime minister from 1951-55.
Huh, it's probably that simple, I mean it doesn't explain why it thought it had definitely found the answer given there were a bunch more PMs to go, but yeah that does qualify Churchill.
> I asked ChatGPT to tell me a riddle.

Which version of ChatGPT, if you don't mind me asking?

It was an iOS app (ChatOn) not the official app. So probably GPT-3 (edit: the app claims to use GPT-4, so not sure)

After the riddle, I bought the $20/mo subscription via the official OpenAI app to try it on GPT-4. I started by trying to play “20 questions” but we couldn’t get past 10 questions before getting an error message “rate limit exceeded, try again in an hour”

Why would they mind you asking?
I gave playing 20Q with the GPT-4 version, and it did surprisingly well. It didn't quite get it in 20 questions, and correctly gave up once it hit that, but it was almost at my object (a computer power supply). Given what it was asking, I expect it would get it in one of two more questions. It's just one sample, but it asked the right questions to narrow the field to computer-related things, and then narrowed down to power-related things, with its final guess being a UPS.
Is that 3.5 or 4? I don't play word games with GPT but for coding tasks it went from unusable most of the time to absolutely indispensible team member between 3.5 and 4.
Playing “20 questions” with ChatGPT is another great way to expose its limitations. It knows the game and tries to play, but is terrible at asking questions to narrow down possible answers.

I began doing this last winter, and while it tends to be a bit slow I'm quite impressed that it can manage at all.

Chat GPT versus Akinator!
ChatGPT doesn’t know what is right or wrong.

It simply knows what the highest probability next word should be.

I think part of this is because GPT doesn’t have any “hidden variable” storage and doesn’t get any prep time up front to come up with something coherent. Just completes the next token based on the previous context.
You can give it prep time, tell it to reason out loud and it will write a paragraph (or two) about what it is thinking--or rather, the paragraph is its "thinking".
the interesting part is I've interacted with some people going through manic phases and it's REALLY similar to chatgpt, the ability to string words and thoughts together without any large scale plans.
I played 20q three times with GPT4. Aluminum can, plastic straw, and tire pressure gauge were my items. It had no problems getting to the answers with sensible questions throughout.
Maybe you didn't get the riddle.