Hacker News new | ask | show | jobs
by falcor84 620 days ago
... and such that the same increase in problem complexity requires a smaller increase in human effort to solve.

This was the idea with the Winograd schema challenge [0] and now the ARC benchmark [1], but human-level performance on the former was achieved in 2019, and very strong progress is being made over the last few months on the latter. But at the current point in time, it seems that we're pretty much reaching the limit of such challenges that are relatively easy for humans to solve in a single sitting, and we'll have to start switching to benchmarks which rely on extensive work over time, such as SWE-Bench [1], and even there it seems that state of the art AI agents are already doing better than the "average" human developer.

[0] https://en.wikipedia.org/wiki/Winograd_schema_challenge

[1] https://arcprize.org/

[2] https://www.swebench.com/