Hacker News new | ask | show | jobs
by jug 558 days ago
Speaking of which, I wonder how they'd do on SimpleQA. OpenAI is an outlier there in the negative sense vs Anthropic. This benchmark also deals with hallucination and "inappropriate certainty".