| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jpkw 252 days ago
	Hoping someone here may know the answer to this, but do any of the benchmarks that exist currently account for false answers in any meaningful way, other than it would in a typical test (ie, if I give any answer at all it is better than saying "I don't know" as the answer I give at least has a chance of being correct(which in the real world is bad))? I want an LLM that tells me when it doesn't know something. If it gives me an accurate response 90% of the time and an inaccurate one 10% of the time, it is less useful than one that gives me an accurate answer 10% of the time and tells me "I don't know" the other 90%.

3 comments

terandle 252 days ago

https://artificialanalysis.ai/evaluations/omniscience

link

rocqua 252 days ago

Those numbers are too good to expect. If 90% right 10% wrong is the baseline would you take as an improvement:

- 80% right 18% I don't know 2% wrong - 50%/48%/2% - 10%/90%/0% - 80%/15%/5%

The general point being that to reduce wrong answers you will need to accept some reduction in right answers if you want the change to only be made through trade-offs. Otherwise you just say "I'd like a better system" and that is rather obvious.

Personally I'd take like 70/27/3. Presuming the 70% of right answers aren't all the trivial questions.

link

fwip 252 days ago

I think you may have misread. They stated that they'd be willing to go from 90% correct to 10% correct for this tradeoff.

link

rocqua 252 days ago

Thanks for the correction

link

energy123 252 days ago

OpenAI uses SimpleQA to assess hallucinations

link