| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bearjaws 482 days ago

I would argue almost every popular benchmark quoted by the big LLM companies is tainted.

OAI, xAI, Antropic, Google all score incredibly well, then you go to try and write code and its just okay.

They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking.

4 comments

vonneumannstan 482 days ago

>They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking.

Not sure that's really the claim. I think they claim that performance on benchmarks like GPQA indicate PhD level knowledge of different fields.

link

AyyEye 482 days ago

That is the message, it's never usually stated in such a succinct and deniable way.

link

jandrese 482 days ago

Yeah, that's true in many fields with these AI agents. They demo well, but when you put them to actual work they fall right on their face. Even worse, the harder the task you set for them the more they lie to you. It's like hiring a junior dev from one of those highly regimented societies where it's more important to save face than to get the job done.

link

Xelynega 482 days ago

It's almost as if they're not trying to market to the people actually using the products, but trying to convince investors of features that don't exist

link

alfalfasprout 482 days ago

Yep it's "full self driving in 1 year" all over again.

link

ilrwbwrkhv 482 days ago

Its the good old Elon musk playbook spread out across the industry.

link

brookst 482 days ago

Someone should coin a term for this very new phenomenon. Maybe “vaporware”?

link

aprilthird2021 482 days ago

Your last sentence feels kind of spot on. The lack of transparency around confidence in the answer makes it hard to use (and I know it would not be simple to add such a thing)

link

hackernewds 482 days ago

sounds like a skill issue to be honest. you could probably tell the assistant to just ask you questions when information is missing instead

link

dimitri-vs 481 days ago

Have you actually tried this? What happens is it will very often ask you questions at irrelevant times so you start ignoring the questions and it becomes wasted space.

Even OpenAI hasn't figured it out, because their Deep Research always asks questions before starting the search.

link

ryoshu 482 days ago

Programming is easy. Asking the right question is hard.

People don't know what questions to ask.

link

aprilthird2021 482 days ago

But it doesn't know when information is missing

link

washadjeffmad 482 days ago

To be totally fair, using PhD as a barometer of anything without specifying what is like claiming that LLMs have encyclopedic knowledge while meaning a children's encyclopedia.

link

hackernewds 482 days ago

The popular benchmarks are the ones that have already leaked. think about it

link