| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by languid-photic 48 days ago

Yes, the signal we are measuring is quite different from most evals.

We are measuring sth much closer to: when multiple agents compete on the same spec, which one produces the patch that holds up best in code review?

Most evals are static / synthetic, and for code, generally stop at tests. Test evals are weak proxies for quality since it's difficult to encode qualities like scope creep/churn, codebase fit, maintainability etc in tests. [1]

Almost every agent in a given run can pass tests at this point, but there is large separation during review.

[1] https://voratiq.com/blog/your-workflow-is-the-eval

1 comments

BugsJustFindMe 47 days ago

Ok, but my point is that the claims you make about more reasoning performing worse seems kinda suspicious and I haven't seen any analysis exploring why that would happen.

link

languid-photic 47 days ago

My point is more reasoning often leads to worse "scope creep/churn, codebase fit, maintainability".

link

BugsJustFindMe 47 days ago

I get it, but that is a significant claim. And the claim could be right, but it could also be wrong, and I see no analysis, not even a blog post on your website saying "wow, look at this weird thing we found". To me that makes the claim suspicious because it signals that nobody thought to investigate what's going on. Investigating weird results is how we demonstrate that what we're doing is right.

link

languid-photic 47 days ago

It’s mostly a bandwidth thing. We’ve seen the pattern consistently, but haven’t had time yet to write up the analysis carefully.

We are not the only ones to see the reasoning inversion.: https://arxiv.org/abs/2510.11977, https://arxiv.org/abs/2502.08235, https://arxiv.org/abs/2507.14417

link