| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rTX5CMRXIfFG 24 days ago
	Am I crazy, or are these differences between the best models so marginal that you’d get roughly the same performance if you use the same high-quality harness (ie preloaded instructions from md files, including custom skills)?

7 comments

bluegatty 24 days ago

You will immediately notice the difference if you use it at the threshold.

It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.

If you were to just watching them play, work out, shoot - you'd never notice the difference.

Put them head to head and it's 98-54 and you start to see the patterns.

It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.

Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.

Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.

Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.

link

dnnddidiej 24 days ago

Head to head is interesting. I had not tried 2 agents on the same task simulateniously with 2 models.

link

Sparkyte 24 days ago

No you're not wrong. Many people will see what you see. Enthusiasts will see it as monumental squeezing out that last drop of performance. In my opinion I think it is okay for enthusiasts to feel that way. I'm just satisfied with getting a tool as an aid.

Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.

link

raincole 24 days ago

By definition the differences between "best models" are small. It's tautology. If a model is significantly dumber than the others then it's not one of the best models.

link

minimaxir 24 days ago

To an extent. I've had GPT 5.5 solve problems that Opus 4.7 struggled with, using an identical AGENTS.md/CLAUDE.md and no skills.

link

nl 24 days ago

The difference is very noticeable as your codebase gets bigger and you give higher and higher level tasks.

I've certainly had things that Opus fixed using some kind of work around that GPT-5.5 actually solved.

And the difference between the Sonnet/Gemini/DeepSeek tier to the Opus/GPT-5.5 tier is immediately obvious.

link

mrothroc 23 days ago

I have the same experience. I've been running sequential agents in my own harness that is a standard SDLC pipeline (plan, design, code, build, test). It has gates between each stage to control quality.

The big benefit of automating this for so long is that I have lots of data. I analyzed it and found that I can change the models out without much of a change in the output quality.

For one-off tasks, where there is no harness and you're just YOLOing with the TUI, yes, big difference. You need a harness.

The pipeline controls the quality far more than the model, empirically.

link

Hfuffzehn 23 days ago

You have correctly identified that getting a "high-quality harness (ie preloaded instructions from md files, including custom skills)" is the (or at least a) hard part.

Because you have to adjust the harness to your problem space and provide that so you can say it is high-quality.

Many people will stop that discussion at the claude code vs. codex vs. opencode level and then merge that with discussing model performance.

And that is also why "Generate an SVG of a pelican riding a bicycle" is still a benchmark worth discussing. Because at least it is a defined problem space.

link