| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by moffkalast 45 days ago

It's a benchmark and eval issue. Guessing gets them the right result sometimes and the models rank better in error rate than they'd otherwise. We need the kind of benchmarks that penalize being wrong WAY more than saying "I don't know".

Of course there's a secondary problem that the model may then overuse the unintelligible option, but that's something that's a matter of training them properly against that eval.

You could also try thresholding the output based on perplexity to remove the parts that the model is less sure about, but that's not going to be super accurate I think.

2 comments

flumes_whims_ 44 days ago

Benchmarking for giving I don't know rather than wrong answer seems to be the right way to steer industry towards making models that are good at this. AA-Omniscience is one such benchmark.

AA-Omniscience is a knowledge and hallucination benchmark that rewards accuracy, punishes bad guesses and provides a comprehensive view of which models produce factually reliable outputs across different domains. The benchmark contains 6,000 questions across 6 major domains, derived from authoritative academic and industry sources and generated automatically using an LLM-based question generation agent to ensure unambiguity, scalability and factual precision

https://artificialanalysis.ai/evaluations/omniscience

user_7832 45 days ago

Yeah I broadly agree with you. I've tried by explicitly adding a prompt to "ask questions and clarify", and even fairly decent models like Gemini pro (2.5 or 3) tend to make question for the sake of it.

Which reminds me that that's another big issue with LLMs - they'll blindly do whatever you ask them to, without pushback. (Again, I miss 3.5/3.6 era Sonnet which actually had half a spine. Fuck anthropic for blindly chasing coding benchmarks at the cost of everything else.)

I've engaged in several "CMVs" (or "tell me why X is bad") with LLMs, and very often it's clear it's just saying stuff to say it, giving very terrible points on unjustifiable positions that collapse the moment I counter argue even slightly rationally.