| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lnenad 2 days ago

> Total token consumption would also be another thing to consider as well, to rule out TPS.

There is a chart that compares this in the article.

> You should compare apples to apples. Weight them in a way that factors in total task completion time as the measure of "effort", not the arbitrary effort settings provided by the AI company. I don't care what the underlying effort level is, I care which model out of multiple, if running for the same amount of time, completes my task to a more accurate degree.

That's your opinion, my goal is exactly what this benchmark measures, the end result being something I can merge into the codebase based on some configuration setup provided by the lab. I don't run 50 agents in parallel and I am able to use the $100 Anthropic plan just enough that I don't go over the limits.

Also what is your specific argument into the benchmark findings considering that some problems are solved by Opus and not solved by Codex? Whether one uses more tokens or not is a completely different metric.