|
|
|
|
|
by moffkalast
45 days ago
|
|
It's a benchmark and eval issue. Guessing gets them the right result sometimes and the models rank better in error rate than they'd otherwise. We need the kind of benchmarks that penalize being wrong WAY more than saying "I don't know". Of course there's a secondary problem that the model may then overuse the unintelligible option, but that's something that's a matter of training them properly against that eval. You could also try thresholding the output based on perplexity to remove the parts that the model is less sure about, but that's not going to be super accurate I think. |
|
AA-Omniscience is a knowledge and hallucination benchmark that rewards accuracy, punishes bad guesses and provides a comprehensive view of which models produce factually reliable outputs across different domains. The benchmark contains 6,000 questions across 6 major domains, derived from authoritative academic and industry sources and generated automatically using an LLM-based question generation agent to ensure unambiguity, scalability and factual precision
https://artificialanalysis.ai/evaluations/omniscience