| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dr_kiszonka 384 days ago
	Useful benchmark. I noticed o3-high hallucinating too often for such a good model, but it is usually great with search. In my experience, Claude Opus & Sonnet 4 consistently lie, cheat, and try to hide their tracks. Maybe they are good in writing code but I don't trust them with other things.