Hacker News new | ask | show | jobs
by keldaris 77 days ago
As a scientist (computational physicist, so plenty of math, but also plenty of code, from Python PoCs to explicit SIMD and GPU code, mostly various subsets of C/C++), I can confirm - Codex is qualitatively better for my usecases than Claude. I keep retesting them (not on benchmarks, I simply use both in parallel for my work and see what happens) after every version update and ever since 5.2 Codex seems further and further ahead. The token limits are also far more generous (and it matters, I found it fairly easy to hit the 5h limit on max tier Claude), but mostly it's about quality - the probability that the model will give me something useful I can iterate on as opposed to discard immediately is much higher with Codex.

For the few times I've used both models side by side on more typical tasks (not so much web stuff, which I don't do much of, but more conventional Python scripts, CLI utilities in C, some OpenGL), they seem much more evenly matched. I haven't found a case where Claude would be markedly superior since Codex 5.2 came out, but I'm sure there are plenty. In my view, benchmarks are completely irrelevant at this point, just use models side by side on representative bits of your real work and stick with what works best for you. My software engineer friends often react with disbelief when I say I much prefer Codex, but in my experience it is not a close comparison.

3 comments

Have you tried the latest (3.1 pro) Gemini? In my experience, it's notably better for a similar type of problems than Opus 4.6. However, I don't really use OpenAI products to compare.
I actually haven't - I tried Gemini 3.0 Pro in Antigravity and was disappointed enough that I didn't pay much attention to the 3.1 release, it was notably worse than Opus and GPT at the time, and much more prone to "think" in circles or veer off into irrelevant tangents even with fairly precise instruction. I'll give 3.1 a try tomorrow, see what happens.
I've tried both against similar and haven't found it such a clear cut difference. I still find neither are able to fully implement a complex algorithm I worked on in the past correctly with the same inputs. Not sharing exactly the benchmark I'm using but think about something for improving performance of N^2 operations that are common in physics and you can probably guess the train of thought.
I've had reasonable success using GPT for both neighbor list and Barnes-Hut implementations (also quad/oct-trees more generally), both of which fit your description, haven't tried Ewald summation or PME / P3M. However, when I say "reasonable success", I don't mean "single shot this algo with a minimal prompt", only that the model can produce working and decently optimized implementations with fairly precise guidance from an experienced user (or a reference paper sometimes) much faster than I would write them by hand. I expect a good PME implementation from scratch would make for a pretty decent benchmark.
Think another level of complexity of algorithm, different expansion bases plus a mix of input sources. Also not trying to one-shot it.
I can roughly guess the train of thought and I am a bit surprised that Claude is failing you.

That said, I am puzzled at the algorithms that Claude & GPT "get" and ones that they do not.

(former physicist here. would love to know the kind of things you're working on. email on my profile)

>As a scientist (computational physicist,

Is there one that you prefer for, i dunno, physics?