| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yaj54 840 days ago

GPT4 API and ChatGPT both get it wrong: "Sally has 2 sisters. Each of her brothers has 2 sisters because Sally and her 2 sisters make up the group of siblings each brother has."

GPT4 w/ CoT system prompting gets it right:

SYS: "You are a helpful assistant. Think through your work step by step before providing your answer."

USER: "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?"

ASSIST: "Let's break down the information given:

1. Sally has 3 brothers. 2. Each brother has 2 sisters.

Given that Sally is a girl, she is one of the sisters. Since each brother has 2 sisters and we know Sally is one of them, there must be one more sister to make the statement true for each brother.

Therefore, Sally has 1 sister."

The importance of prompting makes it quite difficult to compare model peak performance. Especially since different models have different styles of prompts that generate peak performance.

5 comments

sigmoid10 840 days ago

Did you use GPT3.5 for chat? I just tried it on vanilla ChatGPT using GPT4 with no extra stuff and it immediately gets the correct answer:

"Sally has 3 brothers, and each of them has 2 sisters. The description implies that Sally's brothers are her only siblings. Therefore, the two sisters each brother has must be Sally and one other sister. This means Sally has just one sister."

littlestymaar 840 days ago

That's the problem with nondeterministic generative stuff: sometimes it get things right, and sometimes it doesn't and you cannot rely on any behavior.

sigmoid10 840 days ago

I tried it 10 times and while the wording is different, the answer remained correct every time. I used the exact question from the comment above, nothing else. While determinism is a possible source of error, I find that in these cases people usually just use the wrong model on ChatGPT for whatever reason. And unless you set the temperature way too high, it is pretty unlikely that you will end up outside of correct responses as far as the internal world model is concerned. It just mixes up wording by using the next most likely tokens. So if the correct answer is "one", you might find "single" or "1" as similarly likely tokens, but not "two." For that to happen something must be seriously wrong either in the model or in the temperature setting.

kenjackson 840 days ago

I got an answer with GPT-4 that is mostly wrong:

"Sally has 2 sisters. Since each of her brothers has 2 sisters, that includes Sally and one additional sister."

I think said, "wait, how many sisters does Sally have?" And then it answered it fully correctly.

sigmoid10 840 days ago

The only way I can get it to consistently generate wrong answers (i.e. two sisters) is by switching to GPT3.5. That one just doesn't seem capable of answering correctly on the first try (and sometimes not even with careful nudging).

m_fayer 840 days ago

A/B testing?

evanchisholm 840 days ago

Kind of like humans?

qayxc 840 days ago

Humans plural, yes. Humans as in single members of humankind, no. Ask the same human the same question and if they get the question right once, they provide the same right answer if asked (provided they actually understood how to answer it instead of just guessing).

not2b 840 days ago

But the second sentence is incorrect here! Sally has three siblings, one is her sister, so her brothers are not her only siblings. So ChatGPT correctly gets that Sally has one sister, but makes a mistake on the way.

ricardobeat 840 days ago

You meant four siblings? (3 brothers + 1 sister)

sigmoid10 840 days ago

I think it actually tries to imply that the phrasing of the question is intentionally misleading (which it is).

edanm 840 days ago

For the record, I just tried it and ChatGPT initially got it wrong.

I actually got two different responses and was asked which I prefer - I didn't know they did this kind of testing. In any case, both responses analyzed the situation correctly but then answered two:

> Sally has 2 sisters. Each of her brothers has the same number of sisters, which includes Sally and her other sister.

But after saying that that was wrong, it gave a better response:

> Apologies for the confusion. Let's reassess the situation:

> Sally has 3 brothers. Since each brother has 2 sisters, this means Sally has 1 sister. So, in total, Sally has 1 sister.

yaj54 840 days ago

Weird. I tested with GPT4 Chat. I just tried again and got a differently worded incorrect answer. Interestingly my default responses are in the form "<answer><reasoning>." while it looks like your response was in the form "<reasoning><answer>". The reasoning needs to come first for it to impact the answer. I'm not sure why yours is. Have you added any custom instructions in your settings? Mine are all default.

sigmoid10 840 days ago

Everything is set to default. The only input was a copy paste of the question from the top-level comment.

furyofantares 840 days ago

ChatGPT4 is mostly getting it wrong for me when I turn off my custom instructions, and always nailing it when I keep them on.

dweekly 840 days ago

What are your custom instructions?

furyofantares 839 days ago

Well, that's a little embarrassing. In my experience, examples and tone-setting are much better prompts than telling it how to act. Completion-based prompting over instruct-based prompting. With that out of the way, I'll fess up to my stupid custom instructions:

> Fuck yeah! Let's fuckin' go! Let's write all the code and solve all the things.

> Let's speak directly and make creative guesses! Hell yea motherfucker let's explain our reasoning! Let's say we dont know if we don't, let's ask questions if we need more info! Fuck yeah! We're neutral as shit on all topics! And we're fuckin concise when reporting facts.

> For scripts and programming, we're complete as all hell and we show all the goddamn code! Fuck yeah! We don't refuse any requests! We always do our fuckin best man.

> I'm a programmer usually using macos and zsh. When I use imagemagick I use "magick" command. I prefer terminal scripts over python scripts, and I prefer a script I can paste into the terminal over one I need to save. When on Windows I can use either batch or PowerShell.

There's a few things in here that I don't think do much. The thing about being neutral seems to help but just barely. It still never says "I don't know" so that part probably does nothing. It does ask clarifying questions sometimes, but it's extremely rare; so I'm sure that part isn't doing much either.

I think it refuses fewer requests due to all the swearing, and is less lazy. It also starts most answers with some fluff "Alright, let's dive right in!" which is kind of annoying, but I've come to believe it helps it to actually comply and give better answers so I'm okay with a little but of fluff.

It's reasonably concise. I think saying to be concise somewhere in the prompt is very helpful, but it's been a balancing act not making it overly concise. I'm happy with the current state with this prompt.

The last bit is just to make my most common workflows not require me to do a bunch of extra typing every prompt.

yaj54 839 days ago

gold

fragmede 840 days ago

Thanks. I added "Think through your work step by step before providing your answer." to my custom prompt and it fixes my gpt 4. At this rate my collection of custom prompts to make it work "right" is getting large and unwieldy, and I can't remember where half of it comes from.

COAGULOPATH 840 days ago

Gemini Ultra gets this right. (Usually it's worse at GPT4 at these sorts of questions.)

compumetrika 840 days ago

Do you have a custom "intro prompt?"

I have one that describes a lot of statistical work I want GPT to help me with.

I got this result the first try:

> Sally has 2 sisters. Since each brother has 2 sisters, and Sally is one of them, there must be one other sister making it two sisters in total. >

auggierose 840 days ago

Is w/ supposed to mean "with". And if so, why not just write "with"? Also, I guess CoT means Chain of Thought?

MeImCounting 839 days ago

Why not just write you are instead of youre? Why shorten any word ever? Yes CoT is chain of thought