|
Fascinating, Claude outperforms codex in my coding setup by 5X to 10X, its not even close. Interesting that your claiming the opposite is a general fact here, if i take you literally at your word. Codex has been so bad, in fact, that while I was maintaining a $200/m sub I literally did not even "use it up" while paying for it, ( after cancelling ). But that was 3 months ago, have not tried it since, they could have grown. To be fair, I think what you are meaning, if I drop the literal frame here, is this, tell me if I am right: Codex > Claude in my setup. that right? To be fair, my tests were not apples to apples. I have sophisticated agent alignment harnesses which prevent claude from hallucinating or going off the rails, ( not literally, not 100% = about 80% less hallucination, about 90% less drift, and about 98% more starting from crystal clear intent. And in my personal tests, codex was not calibrated to use those systems, it had them but would have needed to find them. Also I am in a massive project, next ai labs, ixcoach, with likely in the range of 20k files of code, 100x files of docs... It could just be my agent alignment harness thats making claude outperform codex. Looking into testing it on the major benchmarks and publishing the results. |
There is a fun term “jagged frontier”.
Meaning: one model can be much better than the other one in one thing, and much worse than the other in another thing.