| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ozozozd 40 days ago

Not misunderstanding. And I had assumed what you described at first as well.

All I see now is celebration of how agents run for hours and handle “long-time horizons.”

Although the original definition is also flawed for coding. How do you estimate the time it takes to complete a coding task in hours? If we had that formula, why have we been playing estimation poker or resorting to fibonacci series for predicting software tasks? Because you can’t. It’s a made up metric.

1 comments

ben_w 40 days ago

> Not misunderstanding.

Then why did you write "Also, it’s super easy to game. Insert random lags, reduce tokens/sec, there you have a model that maintains attention over “long-time horizons”"?

The wall-clock time the LLM spends per task isn't the metric. How long you can leave the LLM alone, wall-clock time, without intervention, isn't "long-time horizons", it's more like "I gave it a list of tasks and it worked through them". Which is neat when it works, but different.

> All I see now is celebration of how agents run for hours and handle “long-time horizons.”

Yes? And? The long time horizons is with reference *to how long it would take humans to do*. Of course this is celebrated. When I've experimented with them, quite often after finishing one task from the plan, they'll go right on to the next task. Each task may take minutes, but the plan can have hundreds of items in it, and hundreds of minute-by-the-clock tasks is indeed hours.

You're literally, on your opening sentence, complaining about 2 + 2 taking longer to solve, this isn't even close to the point of the "time horizons" metric.

> How do you estimate the time it takes to complete a coding task in hours? If we had that formula, why have we been playing estimation poker or resorting to fibonacci series for predicting software tasks? Because you can’t. It’s a made up metric.

Mostly it wasn't estimated, but rather *measured*:

  2.2 Baselining

  In order to ground AI agent performance, we also measure the performance of multiple human “baseliners” on most tasks and recorded the duration of their attempts. In total, we use over 800 baselines totaling 2,529 hours, of which 558 baselines (286 successful) come from HCAST and RE-Bench, and 249 (236 successful) from the shorter SWAA tasks. 148 of the 169 tasks have human baselines, but we rely on researcher estimates for 21 tasks in HCAST.

  Our baseliners are skilled professionals in software engineering, machine learning, and cybersecurity, with the majority having attended world top-100 universities. They have an average of about 5 years of relevant experience, with software engineering baseliners having more experience than ML or cybersecurity baseliners. For more details about baselines, see Appendix C.1.

- https://arxiv.org/html/2503.14499v3

As with all the other metrics, this is now basically saturated, as nobody seems to want to pay METR $4M to hire a statistically significant number of engineers to spend 4h-1w on each of another 800 baselines for longer tasks. Or if they are, it's being kept very quiet.

link

ozozozd 39 days ago

Ok - someone should tell these people that agents running for hours isn’t a measure of success then.

Not sure how you’d measure software engineering tasks in an isolated manner like that. There are things I need to look up docs for, and others I don’t need to. And that depends on the person. There are tedious tasks that I sometimes get right with my first try, other times I have to look away for a minute and look back at it to get right. There is internet speed. Task evolves or architecture changes mid-task.

I wouldn’t consider anything well-defined and repetitively measurable a “long-time horizon task” - adding a new HTTP handler isn’t one, adding a new React route isn’t one.

Edit: Apparently there are people who care to be precise about this. See: https://subq.ai and how they describe it as "long‑context tasks."

link

ben_w 39 days ago

> Ok - someone should tell these people that agents running for hours isn’t a measure of success then.

To quote the researchers who coined the term:

  Does “time horizon” mean the length of time that current AI agents can act autonomously?

  No. The 50%-time horizon is the length of task (measured by how long it takes a human expert) that an AI agent can complete with 50% reliability. It’s a measure of the difficulty of a task, rather than the time an AI spends to complete the task.

- https://metr.org/time-horizons/

If by "these people" you man people like you who conflate "long time horizon" with "long wall-clock time" like you did, then yes, that's why I replied to you.

Conversely, when a researcher says "I can leave my LLM running for hours, because it has a long time horizon", this is *causality*. Car analogy: if time horizon is fuel efficiency, the LLM working by itself for hours at a time is like driving your car for thousands of miles. The latter can obviously be gamed by having a bigger fuel tank, but also comes automatically from having a more efficient engine. Max range != Engine efficiency, but more efficient engines increase range. "Long wall clock time without intervention" != "Long time horizon", but longer time horizons increase wall clock time without intervention.

In fact, another relevant quote from the researchers who coined the term:

  What does METR mean by a task? Would solving 1000 1-hour math problems in a row be a 1000-hour task?

  Our tasks are meant to be coherent, self-contained units of work that can’t be trivially split into independent pieces. Therefore, solving 1000 separate 1-hour math problems isn’t a 1000-hour task; we’d consider it a 1-hour task done 1000 times. The same idea applies for searching for needles in a 10-million-word haystack. In either case, you could easily split the work across many people working in parallel (or by making many parallel AI calls), so it’s not really a “long” task in the sense we care about.

  In contrast, the prototypical multi-hour task might look like iteratively debugging a complex system, where each fix reveals new problems that only make sense if you know what you already tried.

- https://metr.org/time-horizons/

> Not sure how you’d measure software engineering tasks in an isolated manner like that. There are things I need to look up docs for, and others I don’t need to. And that depends on the person. There are tedious tasks that I sometimes get right with my first try, other times I have to look away for a minute and look back at it to get right. There is internet speed. Task evolves or architecture changes mid-task.

Are you unfamiliar with how statistics deal with such things? Even the quote I gave you in the previous comment had some of the humans failing to complete some of the tasks.

Also, to quote the researchers who coined the term:

  Our tasks are designed to be self-contained and well-specified, so that they’re fair to both the AI agents and the humans. In contrast, most real-world work draws on prior context, such as previous conversations, tacit knowledge, or familiarity with an existing code base. We think it’s better to think of our 2-hour tasks as what someone with low or no prior context (like a new hire or freelance contractor) could complete in 2 hours, rather than someone experienced who is already familiar with the project.

- https://metr.org/time-horizons/

> I wouldn’t consider anything well-defined and repetitively measurable a “long-time horizon task” - adding a new HTTP handler isn’t one, adding a new React route isn’t one.

First, "long" is a relative statement, not absolute. The early models could *only* reliably help with things that take a human a few seconds, e.g. stubbing out a function. Now they're up to 1.5 hours at P(success)=80%, or 11h59m at P(success)=50%. These are what "long time horizons" means in these cases: https://metr.org/time-horizons/

Second, the entire point of the METR study I linked you to, is to put those tasks you are dismissive of on the same chart as frontier models and early models, in order to find out what kind of things each model can do. I suggest reading it or watching the video, both explain this point.

> Edit: Apparently there are people who care to be precise about this. See: https://subq.ai and how they describe it as "long‑context tasks."

Incorrect. "Long context" is a third thing, "long context" != "Long time horizon" != "Long wall clock time without intervention".

In the car analogy, where "time horizon" maps to fuel efficiency, wall clock time maps to range, context maps to how good your field of view is from the driving seat.

link