| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by NitpickLawyer 35 days ago
	IIRC that graph tracks capabilities as time_to_solve a task for humans (i.e. the model can now handle tasks that usually take a human ~8h). Which, depending on what tasks you look at, could be a reasonable finding. I could see Opus 4.6 handling tasks that take ~8h for humans, and that 5.1 couldn't previously handle (with 5.1 being "limited" at 4h tasks let's say). It is a bit arbitrary, but I think this is what they're tracking.

3 comments

jrumbut 35 days ago

Without knowing more about their methodology, it seems like a lot of the recent improvements have involved the AI itself taking time to complete the task.

At first the models turned a 5 minute task into a 5 second task (by 5 seconds I mean a very short amount of time, not precisely 5 seconds). Then they turned a 15 minute task into a 5 second task.

Opus 4.6 completes 8 hour tasks all the time but (at least in my experience) it isn't spitting the answer out in 5 seconds anymore. It's using chain of thought and tools and the time to completion is measured in minutes or maybe hours.

In my experiments with local LLMs, a substantial part of the gap between frontier and local (for everyday use) is in tooling and infrastructure.

That is why I am sympathetic to the idea we are leveling off. But to bring in the air speed example from the article, I don't think we've reached the equivalent of the ramjet yet. I suspect in the coming years there will be new architectures, new hardware, and new ways to get even more capable models.

link

Leynos 34 days ago

It measures ability to complete (with a given success rate) a task with a known human benchmark time to complete. I.e., they set the task to human volunteers and timed how long they took the complete that task.

link

lukan 35 days ago

"It is a bit arbitrary, but I think this is what they're tracking."

I don't know if they can get their numbers right this way, but this seems a way more useful metric, than theoretic capabilities.

link

cyanydeez 35 days ago

ok, but arn't you just measuring efficiency and not the big I in AGI improvements.

link

jsnell 34 days ago

No? I think you're misunderstanding what is being measured.

It is purely a test of capabilities (can it do a thing that takes a human $X hours), not efficiency (how fast will it do it).

link

Leynos 34 days ago

It also measures task coherence—ability to plan, form contingencies, recover from errors, mitigate accumulation of errors, and reconcile findings across a long context window.

link

lukan 35 days ago

Yes, but this study was not about that and "just efficiency" is actually what most people are after.

At least I want AI to solve my problems, not score high on a academic leaderboard.

link

MadxX79 35 days ago

I don't know why people are so impressed by 8h.

I trained an LLM to write the whole Harry Potter series, and that took JK Rowling like 17 years.

For my next point on the graph, I'll train the LLM to write the Bible, something that took humans >1500 years.

link

Smaug123 34 days ago

Have you used the models, out of interest? They routinely do things autonomously that are not in the training set that would take me 8h, and I wouldn't say I'm slow. The profile of tasks they can do this way is jagged, and maintaining architectural coherence ("months, not hours") is still beyond them, but they're perfectly capable of writing plans and sticking to them.

link

MadxX79 34 days ago

Yeah, I use them all the time. I just don't see any good argument that it's anything other than statistical pattern matching plus some sort of logic encoded in language. My overfitted LLM obviously didn't arrive at Harry Potter the same way JK Rowling did, so the amount of time she spent writing it is completely irrelevant to any discussion about whether or not the LLM should be able to reproduce it. discussions of AGI if it took her an hour or a decade to write it, it has seen the result, so it can reproduce it.

link

Smaug123 32 days ago

I don't think you've addressed the fact that they can do long tasks that aren't in the training set? (And the fact that they're just statistical models isn't very relevant. So am I!)

link

Leynos 34 days ago

Look at the tasks in the benchmark (see §2 https://arxiv.org/html/2503.14499v3)

link

MadxX79 34 days ago

Yeah, what about them? As far as I read it the tasks are fixed. The AI companies should know the tasks by now, and have overfitted their models on the tests by now, in the same way I'm implying I overfitted my model to reproduce Harry Potter.

link

Leynos 33 days ago

You can choose to believe that.

link