| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hypoxia 543 days ago

Many are incorrectly citing 85% as human-level performance.

85% is just the (semi-arbitrary) threshold for the winning the prize.

o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3.

...

Here's the full breakdown by dataset, since none of the articles make it clear --

Private Eval:

- 85%: threshold for winning the prize [1]

Semi-Private Eval:

- 87.5%: o3 (unlimited compute) [2]

- 75.7%: o3 (limited compute) [2]

Public Eval:

- 91.5%: o3 (unlimited compute) [2]

- 82.8%: o3 (limited compute) [2]

- 64.2%: human average (Mechanical Turk) [1] [3]

Public Training:

- 76.2%: human average (Mechanical Turk) [1] [3]

...

References:

1 comments

If my life depended on the average rando solving 8/10 arc-prize puzzles, I'd consider myself dead.