Hacker News new | ask | show | jobs
by gsandahl 310 days ago
We are running task specific benchmarks across a number of categories (agentic tasks, context tasks, normalization tasks etc), and on our benchmarks we see Gpt-5 rating slightly below o3. But at a much lower cost.

See https://opper.ai/models

1 comments

Most of the tasks have assessed with ground truth, occasionally helped with an LLM as a judge to assess the answer if the answer is a sentence and not an exact result.

Example: Given a long travel journal How many cities does the author mention? GPT-5: 12 Expected: 17