Hacker News new | ask | show | jobs
by balefulboy 10 days ago
METR's time horizon is not a reliable metric of LLM capability growth: https://www.transformernews.ai/p/against-the-metr-graph-codi...
2 comments

Yes I've seen this before, and while the critiques are fair and high quality (and unfortunately not unique to METR) we're missing the forest for the trees here.

First of all, if you take the articles critiques and work out the implications on the METR graph, all you're doing is shifting the curve up or down, it doesn't change the fact that progress is scaling exponentially. While it is technically possible the universe could be throwing a massive pathological curveball to change the conclusion from METR data (which is we've been seeing exponential growth over the last 6 years), I think that seems very far from likely. The fact that we see the same behavior from a variety of sources over a wide variety of tasks and domains is a pretty clear indication that METR while certainly far from perfect is actually painting a consistent picture at least in terms of the rate of progress.

You can look at ECI for a summary benchmark statistic, which does NOT use METR's benchmark, and you see a similar trend. Same with SWE-bench where the task distribution is far more in domain for real world problems. It is a bummer that this METR data can't be better funded. It would probably take $1M or so to really beef it up properly which any of these labs probably have in their couch cushions.

Wow. This deserves to be much more widely read. Thank you for this.