Hacker News new | ask | show | jobs
by bearjaws 482 days ago
I would argue almost every popular benchmark quoted by the big LLM companies is tainted.

OAI, xAI, Antropic, Google all score incredibly well, then you go to try and write code and its just okay.

They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking.

4 comments

>They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking.

Not sure that's really the claim. I think they claim that performance on benchmarks like GPQA indicate PhD level knowledge of different fields.

That is the message, it's never usually stated in such a succinct and deniable way.
Yeah, that's true in many fields with these AI agents. They demo well, but when you put them to actual work they fall right on their face. Even worse, the harder the task you set for them the more they lie to you. It's like hiring a junior dev from one of those highly regimented societies where it's more important to save face than to get the job done.
It's almost as if they're not trying to market to the people actually using the products, but trying to convince investors of features that don't exist
Yep it's "full self driving in 1 year" all over again.
Its the good old Elon musk playbook spread out across the industry.
Someone should coin a term for this very new phenomenon. Maybe “vaporware”?
Your last sentence feels kind of spot on. The lack of transparency around confidence in the answer makes it hard to use (and I know it would not be simple to add such a thing)
sounds like a skill issue to be honest. you could probably tell the assistant to just ask you questions when information is missing instead
Have you actually tried this? What happens is it will very often ask you questions at irrelevant times so you start ignoring the questions and it becomes wasted space.

Even OpenAI hasn't figured it out, because their Deep Research always asks questions before starting the search.

Programming is easy. Asking the right question is hard.

People don't know what questions to ask.

But it doesn't know when information is missing
To be totally fair, using PhD as a barometer of anything without specifying what is like claiming that LLMs have encyclopedic knowledge while meaning a children's encyclopedia.
The popular benchmarks are the ones that have already leaked. think about it