| There is still plenty of room for growth on the ARC-AGI benchmarks. ARC-AGI 2 is still <5% for o3-pro and ARC-AGI 1 is only at 59% for o3-pro-high: "ARC-AGI-1:
* Low: 44%, $1.64/task
* Medium: 57%, $3.18/task
* High: 59%, $4.16/task ARC-AGI-2:
* All reasoning efforts: <5%, $4-7/task Takeaways:
* o3-pro in line with o3 performance
* o3's new price sets the ARC-AGI-1 Frontier" - https://x.com/arcprize/status/1932535378080395332 |
Given the models don’t even see the versions we get to see it doesn’t surprise me they have issues we these. It’s not hard to make benchmarks that are so hard that humans and Lims can’t do.