|
|
|
|
|
by Eridrus
8 days ago
|
|
The problem is that there are a bunch of benchmarks, the model providers often don't even use the same benchmarks, a bunch of them have known problems, and it's expensive to do your own benchmarks. I am a GPT 5.x booster since to me it just feels smarter, and I generally felt like the benchmarks backed me up, but it's not every benchmark, so sadly we're mostly arguing about vibes. SWEBench-Pro was a big one, though apparently Claude was reading solutions out of the .git folder it wasn't meant to have access to among other problems. |
|
I’m currently working on two projects/clients one using Claude, one using Codex. I have a strong preference for the latter, but not because I think it is much more intelligent or writes much better code. It is simply because I find the way of interacting with it more pleasant: more literal, mechanical, makes fewer assumption and or double checks, and is less proactive in my experience. At least until some updates over the last few weeks.