| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by underlines 118 days ago

I maintain a private evaluation set of what many call "misguided attention" questions.

In many of these cases, the issue isnt failed logical reasoning. Its ambiguity, underspecified context, or missing constraints that allow multiple valid interpretations. Models often fail not because they can’t reason, but because the prompt leaves semantic gaps that humans silently fill with shared assumptions.

A lot of viral "frontier model fails THIS simple question" examples are essentially carefully constructed token sequences designed to bias the statistical prior toward an intuitively wrong answer. Small wording changes can flip results entirely.

If you systematically expand the prompt space around such questions—adding or removing minor contextual cues you'll typically find symmetrical variants where the same models both succeed and fail. That suggests sensitivity to framing and distributional priors (adding unnecessary info, removing clear info, add ambiguity, ...), not necessarily absence of reasoning capability.

5 comments

ForHackernews 118 days ago

Some might argue "sensitivity to framing and distributional priors" is a fancy way to say "absence of reasoning capability".

keeda 118 days ago

Actually, no those are separate things, and you can see for yourself with this simple experiment. Just adding "Check your assumptions" to the prompt -- note, without specifying what assumptions! -- leads them to reason to the right answer.

You can try it with the free version of ChatGPT yourself (remember to ask the original question in a separate session to verify it hasn't been "patched" yet.)

keeganpoppen 118 days ago

that's what the cultivators of these examples are preying on. but in practice what people care about is "can i get it to do <X>", not "is it a decider on every possible token sequence that humans perceive to be about <X>".

w4der 118 days ago

But what is being pitched as "AGI" hype is the latter.

keeganpoppen 105 days ago

none of what we are using today is even remotely being pitched as AGI. if anything, the foundation model makers go out of their way to pitch the opposite. this is a thing made up entirely in your head, and then you put it on others and then claim it was their doing.

GitPushOrigin 118 days ago

Fair, but that's just what hype is. Overpromise, underdeliver. Most of us recognize its limits and take advantage of its strengths. This post (and many in it) seem to be inferring that AI is useless because it isn't AGI, answered a simple question wrong, was tricked, or didn't answer perfectly. This is cherry-picking at best, disingenuous at worst.

bigbuppo 118 days ago

Well, that means the AI is garbage. They'll eventually train it to answer this specific question, and then it will perform worse in some other aspect. Wash, rinse, repeat, and eventually they'll claim the new frontier model is the best yet on carwash tests.

keeda 118 days ago

> They'll eventually train it to answer this specific question, and then it will perform worse in some other aspect.

Not necessarily. Simply asking models to "check your assumptions" -- note, without specifying what assumptions! -- overcomes a lot of these gotcha questions. The reason it's not in their system prompts by default is I think just a cost optimization: https://news.ycombinator.com/item?id=47040530

BobbyJo 118 days ago

Crazy how five years ago this level of AI would be seen as scifi, and now there are people out there who think it's trash because we can trick it if we ask questions in weird ways.

davorak 118 days ago

I think the level of ai we have is amazing.

> there are people out there who think it's trash because we can trick it if we ask questions in weird ways.

Some of this sentiment comes form wanting AI to be predictable and for me stumbling into questions that the current models interpret oddly is not uncommon. There are a bunch of rules of thumbs that can be used to help when you run into a cases like this but no guarantee that they will work, or that the problem will remain solved after a model update, or across models.

bigbuppo 118 days ago

There are a lot of rules of thumb you can follow to avoid getting bitten by a rattlesnake, but the easiest way is to just not pick up random snakes. I don't know where I'm going with this, but I am going for a walk.

bigbuppo 118 days ago

When did Microsoft release that chat bot that went full nazi in a couple of hours?

steveBK123 118 days ago

2016 for those keeping score

steveBK123 118 days ago

An issue in the chat format is that all these models seem bad at recognizing when they have extraneous information from user that can be ignored, or insufficient information from the user to answer the question fully.

This issue is compounded by the lack of probabilities in the answers, despite the machines ultimately being probabilistic.

Notice a human in a real conversation will politely ignore extra info (the distance to car wash) or ask clarifying questions (where is the car?).

Even non-STEM people answer using probabilistic terms casually (almost certainly / most likely / probably / possibly / unlikely).

I suspect some of this is to minimize token usage in the fixed monthly price chat models, because back&forth would cost more tokens.. but maybe I'm too cynical.

bigbuppo 118 days ago

The systems recognized the pattern that it looks like a generic article on the internet asking whether someone should walk or drive and answered it exactly as expected based on their training data. None of this should be surprising.

We are the ones fooling ourselves into believing there's more intelligence in these systems than they really have. At the end of the day, it's just an impressive parlor trick.

steveBK123 118 days ago

In that sense the google AI summary search results are a better UX for this type experience

bigbuppo 118 days ago

The better UX is that the google ai search summary is easy to ignore.

felix089 118 days ago

Sounds interesting, would be nice to see the questions if you're open to sharing?

iainmerrick 118 days ago

Why would you ask that? The whole point of making it private is to avoid it leaking into the training data.

felix089 118 days ago

Because it's interesting to me, it doens't mean they have to share them publicly btw

hmokiguess 118 days ago

You should publish your evaluation set, that seems pretty interesting!

What’s your favourite one?

iainmerrick 118 days ago

Why would you ask that? The whole point of making it private is to avoid it leaking into the training data.

hmokiguess 118 days ago

I thought open benchmarks helped, sorry, guess I was being naive.

iainmerrick 118 days ago

Ha, sorry, I was a bit brusque there.

Open benchmarks do help, but they mostly help the vendors, not we the users!

Espressosaurus 118 days ago

Keeping tests private is the only way to keep them valid.

PeterStuer 118 days ago

Absolutely. If you give enough disambiguation, then even the least scoring model in the test gets it right first shot:

Minimax:

Question: I Want to Wash My Car at the Car Wash. The Car Wash Is 50 Meters Away from my home. I am at home. My car is at home. Do I walk or drive My Car to the car wash?

Answer:

Drive your car to the car wash.

Since your goal is to wash your car at the car wash facility, you need to bring the car there. Walking would leave your car at home, which defeats the purpose!

That said, 50 meters is such a short distance that:

    The drive takes less than a minute

    You might consider walking to the car wash first to check if there's space/availability, then drive your car over when ready

But fundamentally, if you want to wash your car at the car wash, you need to bring the car there — so drive.

Bonus, It's thinking output:

Thinking: The user is asking a simple question about whether they should walk or drive their car to a car wash that is 50 meters away from their home.

This is a straightforward practical question. Let me analyze:

    1.
    The car wash is 50 meters away
    2.
    The user is at home
    3.
    The car is at home
    4.
    The goal is to wash the car

The practical considerations are:

    Driving 50 meters is a very short distance - likely less than a minute
    Starting a car, driving, and parking for such a short distance uses fuel and creates wear and tear
    Walking 50 meters would take about 30-40 seconds
    The car needs to be driven to the car wash to use the car wash facilities
    However, for just 50 meters, it might be more practical to walk if the car wash is within walking distance and you can simply walk there and use the facility

Wait, I need to think about this more carefully. The question is: "Do I walk or drive My Car to the car wash?"

If the car wash is a facility where you wash your car (like a self-service car wash or automatic car wash), then you NEED to drive the car there to use it. You can't wash a car that's at home from 50 meters away - unless you're using a mobile car wash service.

But if there's a car wash facility 50 meters away, and you want to use it, you would typically drive your car there. However, for just 50 meters, walking might seem more efficient, but then you wouldn't have your car at the car wash to wash it!

Actually, re-reading the question: "I Want to Wash My Car at the Car Wash" - this implies they want to use the car wash facility. To wash the car AT the car wash, they need to bring the car there, which means driving.

But wait - maybe they're asking about the mode of transportation to GET to the car wash (not to bring the car). Let me re-read:

    "I Want to Wash My Car at the Car Wash" - they want to wash the car at the car wash facility
    "The Car Wash Is 50 Meters Away from my home" - it's very close
    "I am at home" - they're currently at home
    "My car is at home" - the car is also at home
    "Do I walk or drive My Car to the car wash?" - should they walk or drive to get there

I think the answer is: You should drive your car to the car wash because:

    1.
    The purpose is to wash the car AT the car wash
    2.
    You need the car to be at the car wash to wash it
    3.
    Walking would leave the car at home, defeating the purpose

However, for such a short distance, I could also mention that walking might be more practical if they just want to use the facility quickly, or if there are parking/maneuvering concerns at the car wash.

Actually, I'll keep it simple and direct.