| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aspenmartin 1 day ago

I appreciate the data here but I don't think the read is quite right;

Saying we have linear capability for super-linear cost compares an unbounded variable (dollars) to bounded instruments (because benchmarks saturate). On unbounded measures, growth is exponential; you can see METR time horizons double every ~4-7 months (https://metr.org/blog/2026-1-29-time-horizon-1-1/). And capability being proportional to log(compute) is what the scaling law predicts.

Epoch puts training cost growth at ~2.4x/year as your link shows. Meanwhile cost for fixed capability falls ~10-40x/year (https://epoch.ai/data-insights/llm-inference-price-trends), and lab revenue is growing ~10x/year! Anthropic went from $1B to $9B to $30B+ run rate in ~15 months, OpenAI ~$25B.

On [3]: the "destroying value" conclusion flips sign on an assumed 15% baseline rework rate. The report's most direct metric is +16% merged PRs per dev. The RCT evidence is genuinely mixed (METR: -19%, with n = 20 and Claude 3.x; Cui et al: +26%) but its just super hard to do this well, I think Faros stuff was pretty cool, I haven't seen this before so thank you for the reference.

2 comments

oudlys 1 day ago

>"On unbounded measures, growth is exponential"

Maybe. There was a great comment in the thread on Fable 5 yesterday about benchmark comparisons between Fable and the latest opus models. here it is: https://news.ycombinator.com/item?id=48464600.

You could be right, but this is the most direct benchmark comparison I could find and it's not that strong.

>the "destroying value" conclusion flips sign on an assumed 15% baseline rework rate. The report's most direct metric is +16% merged PRs per dev.

I discuss this directly in my analysis. There's also an 860% code churn increase ratio. You only need 9% of that to be allocated to wasteful rework to drive throughput flat to the 15% rework baseline. Not to an assumed ideal state where there was no rework.

But even if it were not true, a 16% throughput improvement is pretty weak given the investment - especially given the direct evidence of quality degradation. IMO.

I appreciate you reading my stuff and taking the data seriously. Thank you.

link

andrekandre 1 day ago

  > But even if it were not true, a 16% throughput improvement is pretty weak given the investment - especially given the direct evidence of quality degradation. IMO.

n=1 but at $JOB we have throughput quotas now, and what is happening is that teams are just finding lots of busywork (renaming things, gardening of ai .md files, rewriting uis etc) and also dividing prs into smaller chunks to match the quotas... so even "throughout increase" doesn't say much if its not for improving the customer outcome (ime anyways)

link

oudlys 16 hours ago

Productivity != value.

Thanks for the story.

link

balefulboy 1 day ago

METR's time horizon is not a reliable metric of LLM capability growth: https://www.transformernews.ai/p/against-the-metr-graph-codi...

link

aspenmartin 11 hours ago

Yes I've seen this before, and while the critiques are fair and high quality (and unfortunately not unique to METR) we're missing the forest for the trees here.

First of all, if you take the articles critiques and work out the implications on the METR graph, all you're doing is shifting the curve up or down, it doesn't change the fact that progress is scaling exponentially. While it is technically possible the universe could be throwing a massive pathological curveball to change the conclusion from METR data (which is we've been seeing exponential growth over the last 6 years), I think that seems very far from likely. The fact that we see the same behavior from a variety of sources over a wide variety of tasks and domains is a pretty clear indication that METR while certainly far from perfect is actually painting a consistent picture at least in terms of the rate of progress.

You can look at ECI for a summary benchmark statistic, which does NOT use METR's benchmark, and you see a similar trend. Same with SWE-bench where the task distribution is far more in domain for real world problems. It is a bummer that this METR data can't be better funded. It would probably take $1M or so to really beef it up properly which any of these labs probably have in their couch cushions.

link

oudlys 14 hours ago

Wow. This deserves to be much more widely read. Thank you for this.

link