|
|
|
|
|
by ozozozd
40 days ago
|
|
Not misunderstanding. And I had assumed what you described at first as well. All I see now is celebration of how agents run for hours and handle “long-time horizons.” Although the original definition is also flawed for coding. How do you estimate the time it takes to complete a coding task in hours? If we had that formula, why have we been playing estimation poker or resorting to fibonacci series for predicting software tasks? Because you can’t. It’s a made up metric. |
|
Then why did you write "Also, it’s super easy to game. Insert random lags, reduce tokens/sec, there you have a model that maintains attention over “long-time horizons”"?
The wall-clock time the LLM spends per task isn't the metric. How long you can leave the LLM alone, wall-clock time, without intervention, isn't "long-time horizons", it's more like "I gave it a list of tasks and it worked through them". Which is neat when it works, but different.
> All I see now is celebration of how agents run for hours and handle “long-time horizons.”
Yes? And? The long time horizons is with reference *to how long it would take humans to do*. Of course this is celebrated. When I've experimented with them, quite often after finishing one task from the plan, they'll go right on to the next task. Each task may take minutes, but the plan can have hundreds of items in it, and hundreds of minute-by-the-clock tasks is indeed hours.
You're literally, on your opening sentence, complaining about 2 + 2 taking longer to solve, this isn't even close to the point of the "time horizons" metric.
> How do you estimate the time it takes to complete a coding task in hours? If we had that formula, why have we been playing estimation poker or resorting to fibonacci series for predicting software tasks? Because you can’t. It’s a made up metric.
Mostly it wasn't estimated, but rather *measured*:
- https://arxiv.org/html/2503.14499v3As with all the other metrics, this is now basically saturated, as nobody seems to want to pay METR $4M to hire a statistically significant number of engineers to spend 4h-1w on each of another 800 baselines for longer tasks. Or if they are, it's being kept very quiet.