Hacker News new | ask | show | jobs
by fiso64 21 days ago
The fact that claude and gpt 5.5 have nearly the same scores tells me your benchmark is not capturing a significant gap in capability between these two. What the linked page says about Claude is true in my experience: It frequently forgets important instructions and likes to take lazy shortcuts. Gpt by contrast is much more attentive and takes its time when needed to deliver a complete and robust solution. I have tested both models on two private repos (c#, go) on two long-horizon tasks with well-defined stop conditions and observed the same pattern in both cases. Both models still require a large harness to reduce shortcuts and architecturally unclean code, but gpt performs much better, to the point where I find claude unusable for any significant work.
1 comments

GPT 5.5 does significantly outperform Opus 4.7 in the coding parts of our evals.

We also incorporate live decision making on social games (where GPT 5.5 has actually regressed from earlier models, which shouldn't be a huge surprise if you ever tried talking it out of some of its nits).

We are still looking for a way to integrate "logical" intelligence with social intelligence in a less arbitrary way, so I'd take a look at the use case that applies to you (probably coding): https://gertlabs.com/rankings?mode=agentic_coding