Hacker News new | ask | show | jobs
by ofermend 558 days ago
Gemini-2.0-Flash does extremely well on the Hallucination Evaluation Leaderboard, at 1.3% hallucination rate https://github.com/vectara/hallucination-leaderboard
2 comments

Fascinating, thanks for calling that out: I found 1.0 promising in practice, but with hallucination problems. Then I saw it had gotten 57% of questions wrong on open book true/false and I wrote it off completely - no reason to switch to it for speed and cost if it's just a random generator. That's a great outcome.
Speaking of which, I wonder how they'd do on SimpleQA. OpenAI is an outlier there in the negative sense vs Anthropic. This benchmark also deals with hallucination and "inappropriate certainty".