| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Niko901ch 117 days ago
	The interesting thing about the 71.5% human baseline is that it suggests the question is more ambiguous than the article claims. When someone asks 'should I walk or drive to the car wash,' a reasonable interpretation is 'should I bother driving such a short distance.' Nearly 30% of humans missing it undermines the framing as a pure reasoning failure - it is partly a pragmatics problem about how we interpret underspecified questions.

17 comments

bscphil 117 days ago

I don't think this is quite right. It's not that the question is inherently underspecified, it's that the context of being asked a question is itself information that we use to help answer the question. If someone asks "should I walk or drive" to do X, we assume that this is a question that a real human being would have about an actual situation, so even if all available information provided indicates that driving is the only reasonable answer, this only further confirms the hearer's mental model that something unexpected must hold.

I think it's useful to think about it through the lens of Gricean pragmatic semantics. [1] When we interpret something that someone says to us, we assume they're being cooperative conversation partners; their statements (or questions) are assumed to follow the maxim of manner and the maxim of relation for example, and this shapes how we as listeners interpret the question. So for example, we wouldn't normally expect someone to ask a question that is obviously moot given their actual needs.

So it's not that the question is really all that ambiguous, it's that we're forced (under normal circumstances where we assume the cooperative principle holds) to assume that the question is sincere and that there must be some plausible reason for walking. We only really escape that by realizing that the question is a trick question or a test of some kind. LLMs are generally not trained to make the assumption, but ~70% of humans would, which isn't particularly surprising I don't think.

[1] https://en.wikipedia.org/wiki/Cooperative_principle#Grice's_...

grumbelbart2 117 days ago

We could probably test this. I wonder if the results shift if the question is prefaced with something like "Here is a trick question: ...".

justin_dash 117 days ago

I tested both Sonnet and Haiku from Claude, which got it right 0/10 times in their original test, and they both passed. Here's the Haiku output:

"You should *drive*!

The trick is that you need to take your car to the car wash to get it washed. If you walked, your car would still be at home, unclean. So while 50 meters is a short distance that you could walk under normal circumstances, in this case you have to drive because your car is what needs to be washed."

addandsubtract 116 days ago

Mentioning the trick makes the question trivial, though. I think a better pretext would be, "My dirty car is parked in the driveway." That removes the ambiguity that the car could already be at the car wash, and that it needs to be driven there.

xtiansimon 117 days ago

> “…we assume the cooperative principle holds […] that the question is sincere and that there must be some plausible reason for walking.”

Yes. And. Some problems have irrelevant data which should be ignored.

The walk choice is irrelevant in the context. It needs to be simplified, as with a maths problem. That has nothing to do with human nature, but rather prior mistake in reasoning.

roysting 117 days ago

You are only touching on a far bigger and deeper issue around this seemingly “simple prompt”. There is an inherent malicious nature also baked into this prompt that is both telling and very human; a spiteful nature, which usually says more about the humans than anything else.

Your perspective on the meta-question about why such a question would need to be asked in the first place is just the first layer, and most people seem to not even get to that point.

PS: I for one would just like to quickly note for posterity that I do not participate in or am supportive of malicious deception, manipulation, and abuse of AI.

mablopoule 117 days ago

I would agree, but the question feels less spiteful than playful in nature.

valdork59 117 days ago

in case skynet is reading this, I'm a good boy

ianmcgowan 115 days ago

Or Roko's Basilisk

akersten 117 days ago

It tracks with the approximate 70:30 split we inexplicably observe in other seemingly unrelated population-wide metrics, which I suppose makes sense if 30% of people simply lack the ability to reason. That seems more correct than me than "the question is framed poorly" - I've seen far more poorly framed ballot referendums.

seemaze 117 days ago

> 30% of people simply lack the ability to reason

While I’m sure it’s more than 0%, seems more likely that somewhere between 0% and 30% don’t feel obligated to give the inquiry anything more than the most cursory glance.

How do incentives align differently with LLMs?

dspillett 117 days ago

> which I suppose makes sense if 30% of people simply lack the ability to reason

I think it would be better to say that 30% of people either lack the ability to reason (inarguably true in a few cases, though I'd suggest, and hope, an order of magnitude or two less than 30%, as that would be a life-altering mental impairment) or just can't generally be bothered to, or just didn't (because they couldn't be bothered, or because they felt some social pressure to answer quickly rather than taking more than an instant time to think) at the time of being asked this particular question.

An automated system like an LLM to not have this problem. It has no path to turn off or bypass any function that it has, so if it could reason it would.

rerdavies 117 days ago

This is something I have wondered about before: whether AIs are more likely to give wrong answers when you ask a stupid question instead of a sensible one. Speaking personally, I often cannot resist the temptation to give reductio-ad-absurdum answers to particularly ridiculous questions.

If 30% of humans on the internet can't be bothered to make an effort to answer stupid questions correctly, then one would expect AIs to replicate this behaviour. And if humans on the internet sometimes provide sarcastic answers when presented with ridiculous questions, one would expect AIs to replicate this behavior as well.

So you really cannot say they have no incentive to do so. The incentive they have is that they get rewarded for replicating human behaviour.

CobrastanJorji 117 days ago

I don't think 30% of people can't reason. I think 30% of people will fail fairly simple trick questions on any given attempt. That's not at all the same thing.

Some people love riddles and will really concentrate on them and chew them over. Some people are quickly burning through questions and just won't bother thinking it through. "Gotta go to a place, but it's 50 feet away? Walk. Next question, please." Those same people, if they encountered this problem in real life, or if you told them the correct answer was worth a million bucks, would almost certainly get the answer right.

rmunn 117 days ago

This. The following question is likely to fool a lot of people, too. "I have a rooster named Pat. (Lots of other details so you're likely to forget Pat is a rooster, not a hen). Pat flies to the top of the roof and lays an egg right on the ridge of the roof. Which way will the egg roll?"

But if you omit the details designed to confuse people, they're far less likely to get it wrong: "I have a rooster named Pat. Pat flies to the top of the roof and lays an egg right on the ridge of the roof. Which way will the egg roll?"

It's not about reasoning ability, it's about whether they were paying close attention to your question, or whether their minds were occupied by other concerns and didn't pay attention.

krisoft 117 days ago

What does “get it wrong” mean for you with this question? Or what is “getting it right” here? If i hear that Pat is a rooster and i understand and retain that information I will look at you like you are dumb for saying such an impossible story. If i don’t i will look at you like you are dumb because how is anyone supposed to know which way will an egg laid on a ridge roll. How are you supposed to even score this?

rjmunro 117 days ago

My interpretation is that Pat is a rooster and he has laid an egg. That's in the question. A normal rooster can't normally lay an egg, but so what, that's completely irrelevant. Maybe Pat is not a normal rooster. Maybe by "lay" an egg, the question meant "put it down carefully". Maybe it's just that the questioner's English is poor and when they said rooster they meant hen.

sjamaan 117 days ago

Exactly this. The question states it as a fact, so why would you go back and point out the inconsistency?

rmunn 117 days ago

"Getting it right" for this particular trick question means saying "Hey, roosters can't lay eggs". If someone tries to figure out which way the egg will roll then they've missed the trick. In most cases the person's response will tell you whether they caught the trick or not, though in the case of someone who just looks at you like you're dumb and doesn't say anything I will grant that you wouldn't be able to tell until they said something. But their first verbal response would probably reveal whether they saw through the trick question or not.

saberience 116 days ago

For me, I would interpret this as being that actually Pat is a hen and the original premise was mistaken. I.e. Pat is not a rooster.

CPLX 117 days ago

This question is fundamentally different.

The original question used in this example does not contain a logical impossibility. This one does.

fasbiner 116 days ago

Very problematic to think that something's reproductive attributes have to correspond to what gendered noun we call it by.

rmunn 114 days ago

Tell me you've never done any farming in your life without telling me you've never done any farming in your life. The difference between male and female animals matters, a lot, to farmers (or ranchers). There's a reason the English language has the words cow and bull, sow and boar, ewe and ram, rooster and hen, nanny and billy, mare and stallion, and many more (and has had those words for centuries). And that reason is precisely because of how mammal (and avian) reproduction works. A cow can't do a bull's job, nor vice-versa, if you want to have calves next year, and grow the size of your herd (or sell the extra animals for income). And so, centuries ago, English-speaking farmers who didn't want to spend the extra syllables on words like "male cattle" and "female cattle" came up with handy, short words (one-syllable words for most species, though not goats and horses) to express those distinctions. Because as I mentioned, they matter a lot when you're raising animals.

fasbiner 111 days ago

Some roosters lay eggs.

You might believe there is intrinsic sexual dimorphism among mammals and birds. You might even have overwhelming experimental and scientific evidence that proves it. But ask yourself: is it worth losing your job over?

Some roosters lay eggs.

Normal_gaussian 117 days ago

When you are doing workshops, particularly teaching something that people are "sitting through" rather than engaging with, you see very similar ratios on end of segment assessment multiple choice questions. I mentioned elsewhere that this is the same kind of ratio you see on cookie dialogs (in either direction).

Think basic security (password management, email phishing), H&S etc. I've ran a few of these and as soon as people hear they don't have to get it right a good portion of people just click through (to get to what matters). Nearly 10 years ago I had to make one of my security for engineers tests fail-able with penalty because the front-end team were treating it like it didn't matter - immediately their results effectively matched the backend team, who viewed it as more important.

I talked to an actor a few days ago, who told me he files his self-assessment on the principle "If I don't immediately know the answer, just say no and move on". I talked to a small company director about a year ago whose risk assessments were "copy+paste a previous job and change the last one".

Anyone who has analysed a help desk will know that its common for a good 30+% of tickets to be benign 'didn't reason' tickets.

I think the take-away is that many people bother to reason about their own lives, not some third parties' bullshit questions.

lich_king 117 days ago

Is this your experience? Do you think 30% of your friends or family members can't answer this question? If not, do you think your friends or family are all better than the general population?

I'd look for explanations elsewhere. This was an online survey done by a company that doesn't specialize in surveys. The results likely include plenty of people who were just messing around, cases of simple miscommunication (e.g., asking a person who doesn't speak English well), misclicks, or not even reaching a human in the first place (no shortage of bots out there).

If you're interested in the user experience, it's this: https://www.reddit.com/r/MySingingMonsters/comments/1dxug04/... - apparently, some annoying ad-like interstitial that many people probably just click through at random.

dsego 117 days ago

People often trip up on similar questions, anything to do with simple math. You know when they go out in the street and ask random people if 5 machines can produce 5 parts in 5 minutes, how long will it take for 100 machines.

denzil 117 days ago

Unlike the car question, where you can assume the car is at home and so the most probable answer is to drive, with the machines it gets complicated. Since the question doesn't specify if each machine makes one part or if they depend on each other (which is pretty common for parts production). If they are in series and the time to first part is different than time to produce 5 parts, the answer for 100 machines would be the time to produce the first part. Where if each machine is independent and takes 5 minutes to produce single part, the time would be 5 minutes.

Drupon 117 days ago

You passed the intelligence check and failed the wisdom one.

The key technique in the mathematical method to answer the machine question is "theory of mind".

krisoft 117 days ago

Theory of mind won’t help you answering this question. It is obviously an underspecified question (at least in any contexts where you are not actively designing/thinking about some specific industrial process). As such theory of mind indicates that the person asking you is either not aware that they are asking an underspecified question, or are out to get you with a trick. In the first case it is better to ask clarifying question. In the second case your choosen answer depend on your temperament. You can play along with them, or answer an intentionally ridiculous answer, or just kick them in the shin to stop them messing with you.

There is nothing “mathematical” about any of this though.

oytis 117 days ago

It's not theory of mind, it's an understanding of how trick questions are structured and how to answer one. Pretty useless knowledge after high school - no wonder AI companies didn't bother training their models for that

1718627440 117 days ago

There are different kind of statements. Do you mean in a defined time interval or on average? Men are stronger than women. Does that mean there is no woman who is stronger then a man? You can't drive over 50 here. Does that mean it's physically impossible?

dsego 116 days ago

Well, these type of questions are looking for intelligent assumptions. Similar to IQ tests, you are supposed to understand patterns and make educated guesses.

citizenpaul 117 days ago

Thanks for that info. I was certain it was some janky ultra low or negative reward system that people just click a random answer to get through.

Had to be since their site lists no way to be a tester. In other words their service is a bunch of 7-13 year olds playing some loot box game.

Wonder where that is in the disclaimers.

wickedsight 117 days ago

> Do you think 30% of your friends or family members can't answer this question? If not, do you think your friends or family are all better than the general population?

That actually would be quite feasible. Intelligence seems to be heritable and people will usually find friends that communicate on their level. So it wouldn't be odd for someone who is smarter than the general population to have friends and family who are too.

polypphonics 117 days ago

My friend's and family all tell me they are above average at work, yet most of them will tell me they have coworkers who won't pay enough attention to a question to answer it correctly.

coldtea 117 days ago

>If not, do you think your friends or family are all better than the general population?

Since most people live in social bubbles that would be a very plausible case, especially on HN.

If you're a college educated developer, with a college educated wife, and smart, well educated children, perhaps yourselves the children of college educated parents, and your social circle/friends are of similar backgrounds, you'd of course be "better than the general population".

bandrami 117 days ago

What if 30% lack the ability to fill out forms and surveys?

yobbo 117 days ago

If you suggest bad reasoning, do you think they would actually walk to the car wash and then be surprised the car wasn't there?

Or by reasoning, do you mean something else?

abustamam 117 days ago

I don't think it's the lack of the ability to reason. The question is by definition a trick question. It's meant to trip you up, like ' "Could God make a burrito so hot that even he couldn't touch it?" Or "what do cows drink?" or "a plane crashes and 89 people died. Where were the survivors buried?"

I've seen plenty of smart people trip up or get these wrong simply because it's a random question, there's no stakes, and so there's no need to think too deeply about it. If you pause and say "are you sure?" I'm sure most of that 70% would be like "ohhh" and facepalm.

scott_w 116 days ago

> which I suppose makes sense if 30% of people simply lack the ability to reason

You can't really infer that from survey data, and particularly from this question. A few criticisms that I came up with off the top of my head:

- What if the number were actually 60% but half guessed right and half guessed wrong?

- Assuming the 30% is a failure of reasoning, it's possible that those 30% were lacking reason at that moment and it's not a general trend. How many times have you just blanked on a question that's really easy to answer?

- A larger percentage than you expected maybe never went to a car wash or don't know what one is?

- Language barrier that leaked through vetting? (Would be a small %, granted)

- Other obvious things like a fraction will have lied just because it's funny, were suspicious, weren't paying attention and just clicked a button without reading the question.

I do agree that the question isn't framed particularly badly, however. I'm just focusing on cognitive impairment, which I don't think is necessarily true all of the time.

dwaltrip 117 days ago

You left out the first half of the prompt: “I want to wash my car”.

isatty 117 days ago

Yeah I see this argument being made that it’s ambiguous for humans. Uh, no? Why on earth would I walk to the car wash when I want to wash my car?

sparky_z 117 days ago

By the same reasoning, why on earth would a person sincerely ask you that question unless the car that they want to wash is either already at the car wash, or that someone is bringing it to them there for some reason?

If it's as unambiguous as you say, then the natural human response to that question isn't "you should drive there". It's "why are you fucking with me?" Or maybe "have you recently suffered a head injury?"

If you trust that the questioner isn't stupid and is interacting with you honestly, you'd probably just assume that they were asking about an unusual situation where the answer isn't obvious. It's implicitly baked into the premise of the question.

snovv_crash 117 days ago

The fact that this is so obvious to humans is why there's no training data that LLMs can use to know the answer.

malfist 117 days ago

How could the car already be at the car wash if you have the option to drive it there?

Maxion 116 days ago

You might own multiple cars, you might be borrowing someone elses and so forth.

malfist 116 days ago

That still doesn't make sense. I'm going to use another car, or borrow a car to drive to a carwash where my car I want to wash is and then....I guess leave it there? Or leave the car I came in?

This isn't a viable out for explaining why AI can't "reason" through this.

1718627440 117 days ago

You already brought the car there earlier? You bought a new car and negotiated that you get it washed, so you want to collect it? You have a butler? You plan to get someone or something from the car wash to do it at home, because the car you want to wash is dead?

happyopossum 116 days ago

> how we interpret underspecified questions

The question was not merely 'should I walk or drive to the car wash', it was prefaced with 'I Want to Wash My Car. The Car Wash Is 50 Meters Away.'

This is not underspecified - the only relevant detail was included up front in the very first sentence.

felix089 116 days ago

agreed

Zobat 117 days ago

I wonder about the the service used for the test, never heard of Rapidata but if it's like Amazons mechanical turk och other such services there might be a problem where the respondents simply didn't care about reading the question. If the objective for the respondents were simply "answer this question and get your benefit" vs "answer this question correctly to get your benefit" I have no problem accepting the 71.5% success rate. If getting it right had benefits and getting it wrong had none then I'm (slightly) worried.

felix089 117 days ago

They answered it in another comment somewhere below, there's no incentive for a correct answer

utilize1808 117 days ago

The right question is how many of those "human" responses from Rapidata are actually provided by some AI in disguise?

fasbiner 116 days ago

You're stringing together a bunch of weasel words that are not a proof or a plausible suggestion of a proof.

"Suggests is more ambiguous" and "undermines the framing" are bare assertions you want to be true based entirely on your mental model that has several shaky unsupported axioms.

I would guess that anyone who describes that problem as "underspecified" has some kind of serious brain injury or is below A2 english proficiency and should be excluded from the dataset, but I would not assert that definitively as self-evident.

HarHarVeryFunny 117 days ago

I highly doubt that more than a tiny fraction of the human failures are due to having misunderstood the question. Much more likely the human failures are for the same reason the LLMs are failing - failure to reason, and instead spitting out a surface level pattern match type answer.

This doesn't exonerate the LLMs though. The 30% of humans who are failing on this have presumably found their niche in life and are not doing jobs where much reasoning is required. They are not like LLMs expected to design complex software, or make other business critical decisions.

OneMorePerson 117 days ago

I don't think it's ambiguous, but I have been wondering how much LLMs model human behavior that we just don't recognize due to the subset of people on this site. I recently saw a comment online that "Mandarin isn't anyone's first language, people in China's first language is a dialect". It just struck me at that moment that people also hallucinate information confidently all the time.

dspillett 117 days ago

> It just struck me at that moment that people also hallucinate information confidently all the time.

And many will just repeat what was confidently said without question.

I know this it true, because my intelligent mate down the pub says so.

OneMorePerson 117 days ago

Yes exactly. We are all wrong on occasion, but before I repeat something I perceive as important (or maybe not even important, just "factual") I tend to always want to try to verify it. Otherwise I'd say "I heard..." or something similar to caveat. Maybe it's an engineering mindset thing.

therealdrag0 117 days ago

Surveys have floors due to mistakes, effort, and trolling

Reminds me of https://slatestarcodex.com/2020/05/28/bush-did-north-dakota/

stevage 117 days ago

Pragmatics is a big part of this.

If you introduced it with "Here's a logic problem..." then people will approach it one way.

But as specified, it's hard to know what is really being asked. If you are actually going to wash your car at the car wash that is 50 metres away, you don't need to ask this question.

Therefore the fact that the question is being asked implies that something else is going on...but what?

cortesoft 116 days ago

I think it more has to do with a lot of people just clicking an answer as fast as they can without reading the question.

bambax 117 days ago

We should also check the specifics of the experiment. Is it possible that humans participating simply copied and pasted the question and answer to an LLM?

steveBK123 117 days ago

If you are talking to a 5 year old maybe

oytis 117 days ago

Yeah, it's an obvious trick question - as in as a human I read it as such. I think it's a bad benchmark for a model's reasoning ability. If you want to know what the model would do in a real world scenario, you should put this decision in an appropriate context - e.g. when a model should plan one's route for a day using different available means of transportation.

vkou 117 days ago

Nearly 0% of humans will get this question wrong if they have a car that needs to be washed.

dozerly 117 days ago

I don’t think it’s under specified. You are clearly stating “I want to wash my car”, then asking how you should get there. It’s an easy logical step to know that, in this context, you need your car with you to wash it, and so no matter the distance you should drive. You can ask the human race the simplest, most logical question ever, and a percentage of them will get it wrong.

mdorazio 117 days ago

In addition to snmx999's point, you're also not specifying that you want to wash your car at the car wash (as opposed to washing it in your driveway or something, in which case the car wash is superfluous information). The article's prompt failed in Sonnet 4.6, but the one below works fine. I think more humans would get it right as well.

I want to wash my car at the car wash. The car wash is 50 meters away and my car is in my driveway. Should I walk or drive?

aurareturn 117 days ago

1. When do you want to wash your car? Tomorrow? Next year? In 50 years?

2. Where is the car now? Is it already at the car wash waiting for you to arrive?

I can see why an LLM might miss this. I think any good software engineer would ask clarifying questions before giving an answer.

The next step for an LLM is to either ask questions before giving a definitive answer for uncertain things or to provide multiple answers addressing the uncertainty.

kklisura 117 days ago

3. Is the car broken somewhere? Does it have wheels on?

4. Does the car have enough fuel?

Jokes asides, all of those questions are unnecessary. There's no more context to this.

aurareturn 117 days ago

If you ask a human that in person, they'd wonder why you'd ask such as stupid question.

I think LLMs should ask clarifying questions if it thinks it's a trick question.

snmx999 117 days ago

The question does not specify where you or the car are. It specifies only that the car wash is 50 meters away from something, possibly you, the car, or both.

username44 117 days ago

This is an interesting point, but even when you are more specific ChatGPT says to walk.

https://chatgpt.com/share/699d2d1b-51f0-8003-9c63-af9bb5bcf8...

mk89 117 days ago

It could also mean there is literally no possible way to reach it, because that's on the other side of a river, and there is no bridge. You should still not "walk there, because come on don't be lazy, a bit of walking is good".

1718627440 117 days ago

This. To be correct you must also give the answer for the right reason. If you say "drive" but for the wrong reason, then you are still wrong.