| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zone411 384 days ago
	If anyone is interested in a larger sample size comparing how often LLMs confabulate answers based on provided texts, I have a benchmark at https://github.com/lechmazur/confabulations/. It's always interesting to test new models with it because the results can be unintuitive compared to those from my other benchmarks.

1 comments

dr_kiszonka 384 days ago

Useful benchmark. I noticed o3-high hallucinating too often for such a good model, but it is usually great with search. In my experience, Claude Opus & Sonnet 4 consistently lie, cheat, and try to hide their tracks. Maybe they are good in writing code but I don't trust them with other things.

link