| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jug 558 days ago
	Speaking of which, I wonder how they'd do on SimpleQA. OpenAI is an outlier there in the negative sense vs Anthropic. This benchmark also deals with hallucination and "inappropriate certainty".