Y
Hacker News
new
|
ask
|
show
|
jobs
by
jug
558 days ago
Speaking of which, I wonder how they'd do on SimpleQA. OpenAI is an outlier there in the negative sense vs Anthropic. This benchmark also deals with hallucination and "inappropriate certainty".