Hacker News new | ask | show | jobs
by jsnell 45 days ago
This doesn't seem to be controlling for the number of turns in any way. Am I missing something?

Stronger models needing fewer turns to achieve a task feels like a prime source of efficiency gains for agentic coding, more so than individual responses being shorter.

3 comments

They also don't mention what their sample size is, or anything about the distribution of input and response lengths.

It'd be interesting to see the distributions if the author actually plotted the data, so we could see if their analysis holds water or not.

A plot of the input lengths using ggplot2 geom_density with color and fill by model, 0.1 alpha, and an appropriate bandwidth adjustment would allow us to see if the input data distribution looks similar across the two, and using the same for the output length distributions, faceted by the input length bins would give us an idea if those look the same too.

Edit: Or even a faceted plot using input bins of output length/input length.

OpenRouter may see you fire hundreds of requests at them, but they have no idea that "these 50 requests here at 4PM are for task A", "those 100 requests there does task B", etc. So it's a shallow analysis at the "overall request shape" level.
I think it should be tested on goals.

E.g. Crack this puzzle, fix this code so these tests pass. (A human can verify it doesn't cheese things).