| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by languid-photic 48 days ago

We track performance vs. the all-in cost of completing real engineering tasks, rather than cost per token. [1]

Cost per token is a bit misleading because, as others have noted, different models use tokens in different ways. (Aside - This is also why TPS isn't a great metric).

We found that 5.5 is about 1.5-2x more expensive overall. On a "Pareto" basis, we only find 5.5 xhigh worth it. At the lower reasoning levels, 5.4 still edges it out on cost/perf.

We take a spec-driven approach and mostly work in TS (on product development), so if you use a more steer-y approach, or work in a different domain, YMMV.

[1] https://voratiq.com/leaderboard?x=cost

4 comments

digdugdirk 48 days ago

Interesting! I've been thinking about how to create a similar type of evaluation system for myself. How do you handle tweaks to agentic tasks? Say that a model gets pretty close to what you want, so you just need a quick follow up prompt to the original response?

languid-photic 48 days ago

Yes! It depends on the extent of changes needed.

If the changes needed are small, I'll apply the best implementation as a foundation and then just iterate directly.

If the changes needed are drastic, it usually signals that there was sth wrong/ambiguous/etc in the spec (or the ensemble was too weak, which is rarely the case). In cases like this, I improve the spec and then rerun.

If it's in the middle, I'll usually apply the best and write a follow on spec.

digdugdirk 48 days ago

How does that get integrated into the scoring system? I'm imagining a scenario where a cheaper model may get close, but only needs a small follow up to get the desired result. How would this score in comparison to a larger model that got it right the first time - even if it may have been much more expensive overall?

languid-photic 48 days ago

We also use a secondary signal from blinded multi-verifier reviews. Each verifier ranks the candidates, and those verification outcomes serves as an additional quality signal. It's somewhat similar to consensus labeling.

Btw, this also helps manage scale. Eg you have 15 diffs to review. Run a few verifiers to get a short list, then review directly and apply the best.

BugsJustFindMe 48 days ago

It feels pretty weird that your ratings have:

gpt-5-4-high > gpt-5-4-xhigh

gpt-5-4-high > gpt-5-5-high

gpt-5-4 > gpt-5-5

gpt-5-2-high > gpt-5-2-xhigh

No other ratings I've seen show that.

languid-photic 48 days ago

Yes, the signal we are measuring is quite different from most evals.

We are measuring sth much closer to: when multiple agents compete on the same spec, which one produces the patch that holds up best in code review?

Most evals are static / synthetic, and for code, generally stop at tests. Test evals are weak proxies for quality since it's difficult to encode qualities like scope creep/churn, codebase fit, maintainability etc in tests. [1]

Almost every agent in a given run can pass tests at this point, but there is large separation during review.

[1] https://voratiq.com/blog/your-workflow-is-the-eval

BugsJustFindMe 48 days ago

Ok, but my point is that the claims you make about more reasoning performing worse seems kinda suspicious and I haven't seen any analysis exploring why that would happen.

languid-photic 47 days ago

My point is more reasoning often leads to worse "scope creep/churn, codebase fit, maintainability".

BugsJustFindMe 47 days ago

I get it, but that is a significant claim. And the claim could be right, but it could also be wrong, and I see no analysis, not even a blog post on your website saying "wow, look at this weird thing we found". To me that makes the claim suspicious because it signals that nobody thought to investigate what's going on. Investigating weird results is how we demonstrate that what we're doing is right.

languid-photic 47 days ago

It’s mostly a bandwidth thing. We’ve seen the pattern consistently, but haven’t had time yet to write up the analysis carefully.

We are not the only ones to see the reasoning inversion.: https://arxiv.org/abs/2510.11977, https://arxiv.org/abs/2502.08235, https://arxiv.org/abs/2507.14417

lukewarm707 48 days ago

would be interesting to see some other labs:

- deepseek v4 pro

- glm 5.1

- kimi k2.6

- qwen 3.6 max

- xiaomi 2.5 pro

- minimax 2.7

- grok

languid-photic 48 days ago

I agree!

So far we have been native harnessmaxxing, which simplifies things a lot.

The configuration space around open models is much larger. Eg which models, capability heterogeneity, which harness, networking, data egress / privacy, etc.

If anyone is getting very good production code out of open models, I'd love to do a user interview to better understand your setup. Email is in my bio.

thepasch 48 days ago

With how much vendor harnesses are now actively steering the agent with their own instructions on top of user prompts, I think it’d be super interesting to see a comparison of one of the already tested models - so Opus 4.7 or GPT-5.5 - across a range of different harnesses that aren’t their native. OpenCode, Pi, Hermes, Kilo Code. The most popular coding-focused harnesses, basically.

languid-photic 48 days ago

Agreed. Harness is really important. Especially since many labs are now post-training agents directly in their native harness.

(Which is why my prior is that third party harnesses would not perform as well. But I haven't actually measured this.)

cyberpunk 48 days ago

OpenCode seems to give me better results than codex-cli, i’d be interested in seeing this too!

motbus3 48 days ago

But what situation seems to good to enable xhigh?