Hacker News new | ask | show | jobs
by wongarsu 2 hours ago
It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable. I really like that benchmark because it's one of the few benchmarks that allows LLMs to elect not to answer if they are unsure and punishes them for trying to bullshit their way through the benchmark
2 comments

It took me a while to figure out how to interpret the benchmark correctly, because on the overview page it says "AA-Omniscience Non-Hallucination Rate," but on the benchmark page https://artificialanalysis.ai/evaluations/omniscience#aa-omn...

it said "the lower, the better." Eventually, I realized that the "non" reverses the scores. And indeed, the results are consistent.

This implies that other benchmarks (for which every AI provider is optimizing?) are actively encouraging bullshitting?
Yes. Most benchmarks just measure how many answers are correct. The best way to optimize that is to confidently state something, in hopes it's correct. Which is exactly how most LLMs behave, despite plenty of evidence that they do know whether they "know" something
Bullshitting is how LLMs work. It doesn't require active encouragement. All it takes is a machine without consciousness or physical access to the world and an actually-lived life. A training set that contains lots of confident answers and few to no refusals doesn't help either.
They are, especially multiple choice questions. The same happens with humans exams:

Let's say there are 100 questions, with 4 answers each. A good answer is worth 1 point. By just guessing you get an average of 25/100, way more than 0/100 by not replying.

If instead a wrong answer is -1 point, by just guessing you get on average -75/100, way worse than 0/100.