Hacker News new | ask | show | jobs
by flumes_whims_ 30 days ago
Benchmarking for giving I don't know rather than wrong answer seems to be the right way to steer industry towards making models that are good at this. AA-Omniscience is one such benchmark.

AA-Omniscience is a knowledge and hallucination benchmark that rewards accuracy, punishes bad guesses and provides a comprehensive view of which models produce factually reliable outputs across different domains. The benchmark contains 6,000 questions across 6 major domains, derived from authoritative academic and industry sources and generated automatically using an LLM-based question generation agent to ensure unambiguity, scalability and factual precision

https://artificialanalysis.ai/evaluations/omniscience