| Human performance is 85% [1]. o3 high gets 87.5%. This means we have an algorithm to get to human level performance on this task. If you think this task is an eval of general reasoning ability, we have an algorithm for that now. There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works. Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it! [1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1 |
But, still, this is incredibly impressive.