| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Bjorkbat 245 days ago

One of my most frustrating things regarding the potential of an AI bubble was some very smart and intelligent researcher being incredibly bullish on AI on Twitter because if you extrapolate graphs measuring AI's ability to complete long-duration tasks (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...) or other benchmarks then by 2026 or 2027 then you've basically invented AGI.

I'm going to take his statements at face value and assume that he really does have faith in his own predictions and isn't trying to fleece us.

My gripe with this statement is that this prediction is based on proxies for capability that aren't particularly reliable. To elaborate, the latest frontier models score something like 65% on SWE-bench, but I don't think they're as capable as a human that also scored 65%. That isn't to say that they're incapable, but just that they aren't as capable as an equivalent human. I think there's a very real chance that a model absolutely crushes the SWE-bench benchmark but still isn't quite ready to function as an independent software engineering agent.

So a lot of this bullishness basically hinges on the idea that if you extrapolate some line on a graph into the future, then by next year or the year after all white-collar work can be automated. Terrifying as that is, this all hinges on the idea that these graphs, these benchmarks, are good proxies.

And if they aren't, oh wow.

4 comments

woeirua 244 days ago

There's a huge disconnect between what the benchmarks are showing and what the day-to-day experience of those of us using LLMs are experiencing. According to SWE-bench, I should be able to outsource a lot of tasks to LLMs by now. But practically speaking, I can't get them to reliably do even the most basic of tasks. Benchmaxxing is a real phenomenon. Internal private assessments are the most accurate source of information that we have, and those seem to be quite mixed for the most recent models.

link

jzymbaluk 244 days ago

How ironic that these LLM's appear to be overfitting to the benchmark scores. Presumably these researchers deal with overfitting every day, but can't recognize it right in front of them

link

woeirua 244 days ago

I'm sure they all know it's happening. But the incentives are all misaligned. They get promotions and raises for pushing the frontier which means showing SOTA performance on benchmarks.

link

igleria 245 days ago

> very smart and intelligent researcher being incredibly bullish on AI on Twitter

A bit offtopic but as time goes by, I believe we can be very intelligent in some aspects and very, very naive and/or wrong in other aspects.

link

ludicrousdispla 245 days ago

>> by next year or the year after all white-collar work can be automated

Work generates work. If you remove the need for 50% of the work then a significant amount of the remaining work never needs to be done. It just doesn't appear.

The software that is used by people in their jobs will no longer be needed if those people aren't hired to do their jobs. There goes Slack, Teams, GitHub, Zoom, Powerpoint, Excel, whatever... And if the software isn't needed then it doesn't need to be written, by either a person or an AI. So any need for AI Coders shrinks considerably.

link

twothreeone 245 days ago

You mean Julian Schrittwieser (collaborator on AlphaGo and first author on MuZero)?

https://www.julian.ac/blog/2025/09/27/failing-to-understand-...

link