| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by NiloCK 183 days ago

I appreciate horizon expansion as a fundamental metric, but duration seems like too crude a measure. We used to like it when computers were fast.

An infinitely unscrupulous model provider could double this five hour result by cutting your output tokens/second in half!

This isn't only a question of gaming the metric: the very strong current small-fast models (4.5 Haiku, Gemini 3 Flash) have no hope of being measured fairly against this - they will succeed or fail much faster just because they are much faster.

How about something like total output token count as the "long term horizon" metric instead?

3 comments

scellus 183 days ago

The time (horizon) here is not that of the model completing the task, but a human completing the task.

link

NiloCK 181 days ago

Wow that was a garbage comment!

My introduction to this type of model measuring came from an interview where the repeatedly hammered-home point was that Sonnet 4.0 nailed a gigantic refactor (conversion of a large legacy asp.net or similar into react server-side components or similar) in a loop whose runtime was some large number of hours. I mistakenly attributed the same framing here.

link

docstryder 183 days ago

Task duration is the time it would take for humans to complete the task. The speed of the models and how how long they might take to complete the task is not part of this metric.

link