| In addition, the harness around these models do a lot of work and changes the outcome significantly. I just had an issue where Claude CLI with Opus 4.7 High could not figure out why my Blazor Server program was inert, buttons didn't do anything etc. After several rounds, I opened the web console and found that it failed to load blazor.js due to 404 on that file. I copied the error message to Claude CLI and after another several unproductive rounds I gave up. I then moved the Codex, with ChatGTP 5.5 High. I gave it the code base, problem description and error codes. Unlike Claude CLI it spun up the project and used wget/curl to probe for blazor.js, and found indeed it was not served. It then did a lot more probing and some web searches and after a while found my project file was missing a setting. It added that and then probed to verify it worked. So Codex fixed it in about 20 minutes without me laying hands on it (other than approve some program executions). However, I'm not convinced this shows GPT 5.5 being that much better than Opus 4.7. It could very well be the harness around it, the system prompts used in the harness and tools available. For reference this was me just trying to see how good the vibecoding experience is now, so was trying to do this as much hands-off as possible. |
My guess is that it is the fault of the model rather than the harness, I believe Opus to be much worse than it was for whatever reasons. Though I suppose it could be Code’s fault somehow. For the time being though Codex is much better which I never thought I’d be saying.
I plan to run tests using Pi so they have the same system prompt and harness, but I’m suspicious that it’s only the subscription level Claude Code that is worse and we’re not allowed to use that with Pi.