| Many are incorrectly citing 85% as human-level performance. 85% is just the (semi-arbitrary) threshold for the winning the prize. o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3. ... Here's the full breakdown by dataset, since none of the articles make it clear -- Private Eval: - 85%: threshold for winning the prize [1] Semi-Private Eval: - 87.5%: o3 (unlimited compute) [2] - 75.7%: o3 (limited compute) [2] Public Eval: - 91.5%: o3 (unlimited compute) [2] - 82.8%: o3 (limited compute) [2] - 64.2%: human average (Mechanical Turk) [1] [3] Public Training: - 76.2%: human average (Mechanical Turk) [1] [3] ... References: [1] https://arcprize.org/guide [2] https://arcprize.org/blog/oai-o3-pub-breakthrough [3] https://arxiv.org/abs/2409.01374 |