| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gsandahl 357 days ago
	We are running task specific benchmarks across a number of categories (agentic tasks, context tasks, normalization tasks etc), and on our benchmarks we see Gpt-5 rating slightly below o3. But at a much lower cost. See https://opper.ai/models

1 comments

gsandahl 357 days ago

Most of the tasks have assessed with ground truth, occasionally helped with an LLM as a judge to assess the answer if the answer is a sentence and not an exact result.

Example: Given a long travel journal How many cities does the author mention? GPT-5: 12 Expected: 17

link