Hacker News new | ask | show | jobs
by atleastoptimal 299 days ago
I'm referring to the long-horizon task benchmark which has been exponential since GPT-2

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...