| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ofermend 558 days ago
	Gemini-2.0-Flash does extremely well on the Hallucination Evaluation Leaderboard, at 1.3% hallucination rate https://github.com/vectara/hallucination-leaderboard

2 comments

refulgentis 558 days ago

Fascinating, thanks for calling that out: I found 1.0 promising in practice, but with hallucination problems. Then I saw it had gotten 57% of questions wrong on open book true/false and I wrote it off completely - no reason to switch to it for speed and cost if it's just a random generator. That's a great outcome.

link

jug 557 days ago

Speaking of which, I wonder how they'd do on SimpleQA. OpenAI is an outlier there in the negative sense vs Anthropic. This benchmark also deals with hallucination and "inappropriate certainty".

link