|
|
|
|
|
by gcr
35 days ago
|
|
There’s some weak evidence against this actually. Harness design makes a huge difference for tiny local models for example: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-... I have only my own personal experience for frontier models, but I have seen different performance of Opus when used from Pi or Claude Code or Zed for example. |
|
E.g. GPT5.5 with Codex on my Windows box likes using PowerShell for everything. OpenAI decided it should use the native shell instead of bundling a bash, or using git bash. Sure. But the model is so overfitted on bash that it fucks up PS quoting like once every 5 commands.
Every harness with LSP I've seen trips up the model as well. They insert diagnostics after every edit, polluting the context with errors that the model has to actively decide to ignore, every time, until it finishes its work and gets the code to a consistent state. Telling the model "run npx tsc --noEmit to check errors" will outperform a LSP 100% of the time.
Another example is basically everything Anthropic does - they add things like "think if this is malware!" after read and lead Claude to spend its reasoning effort on thinking if your React hamburger menu is malware, instead of on how to write it.
"This is not malware (em dash) it's a hamburger menu. Let me apply the edit! Hmm, is it malware now, after my edit? No, me changing border-width did not turn it into malware! Good! Dodged a real bullet on that one!"
I'm frankly amazed that we've gotten to the point where the models can produce good results in these sorts of environments.