Hacker News new | ask | show | jobs
by neosat 16 days ago
Your argument is fine but different from the claim the OP is making. You cannot simply make a claim that (model + harness) X is better than Y, but then have no discernible difference in the output. Subjectively, people might still prefer one over due to anything from design to marketing, but that's very different from the claim that X is better than Y for coding (see: "A colleague was convinced Claude is better"). Basically, I prefer Claude is a different claim than Claude is better and the latter has a higher bar of proof.
4 comments

> You cannot simply make a claim that (model + harness) X is better than Y, but then have no discernible difference in the output.

You definitely can in principle; that’s the entire point of the comment you are responding to. If one tool completes it in 10 minutes with little hand holding, and the other does it in one hour at 4× the cost and while needing a lot of steering, the former is arguably better even if the end result is the same.

Whether that’s specifically true and demonstrable of GPT and Claude is another question, but your blanket statement doesn’t hold as a general rule.

That's a fair callout and I agree my statement was too general in just mentioning 'output', as you correctly pointed out. To define 'better' you would indeed need to agree on the dimensions you would evaluate candidates against.

I think a more appropriate rephrasing would be 'You cannot simply make a claim that (model + harness) X is better than Y, but then have no discernible difference on dimensions you care about'. In the case of latest of claude code vs codex with gpt 5.5) both are similar enough in the dimensions people will care about in evaluating (vs. differing wildly in cost or time taken).

This obviously correct take will get pushback, so let me add some other examples:

- which tool required more detailed goal-setting in the prompt?

- did one tool ask follow-up questions up front vs spread out over implementation?

- did either tool match existing coding styles?

- did either tool remind you about potential conflicts between what you asked it to build and other parts of the codebase?

There are a lot of ways to compare agents besides just the code. (Similarly, working engineers are not evaluated just on their code output.)

The colleague implicitly agreed that comparing the output was a valid way to settle the matter as they took part in the test, so they weren't using "better" in the way you propose.
I wasn’t really discussing the colleague, but either way, from:

> A colleague was convinced Claude is better so we played a game. We used the claude code and codex harness and I implemented some prs they needed with gpt5.5 and opus4.7 and asked them to identify which came from which only from the code.

I don’t think it’s obvious that they specifically agreed that losing the game meant that. They might just have thought “sure, it might be fun”, if they even gave it that much thought.

“So we played a game” is rather vague and I feel it’s a bit of a leap to read it as: “as an explicit outcome of their claim that Claude is better, we made a formal bet as to whether they could tell the difference in the output, the failure of which would mean a full retractation of their statement”.

Claude and Codex are tools. You can't tell the difference in the output between something that was done with a ratcheting wrench vs a standard combination wrench, but your mechanic certainly knows the ratcheting wrench is better (for most tasks).

I've not used Codex to compare against, so I'm not claiming X is better than Y, but comparing tools simply on their output is naive.

" You cannot simply make a claim that (model + harness) X is better than Y, but then have no discernible difference in the output"

Sorry I think this misses the mark.

Because it's not the output but the process.

And sometimes the outcomes are not always discernable.

Codex and Claude are very different.

I use them for different things.

Their behaviour difference is obvious.

Of course it'd impossible for anyone to tell by looking at my code base 'how it was written'.

You need to see the response in light of the original discussion. Referencing here for clarity since I should have included it in the first place: "We used the claude code and codex harness and I implemented some prs they needed with gpt5.5 and opus4.7 and asked them to identify which came from which only from the code."

So the same person, was using similarly competitive tools, and showing that the output was hard to discern (indirectly the implication was also that implementation was fairly trivial in both of those). A better analogy would not be different process and widely different tools but for example two power drills. Sure, folks could still prefer one over the other, but that's a different claim that saying X is objectively better than Y when both are directly competing on very similar dimensions.

Assuming you meant Claude code: I'd love to learn more about "Codex and Claude are very different" because maybe I'm assuming just based on my use case where I use both of them interchangeably for the same thing (coding web and mobile apps)

It's not reasonable to compare results from two different tool sets, especially as they are guided by humans.

The only way a reasonable comparison could be made, would be to compare completely automated results from either technology - that would be useful.

For example - creating a 'per-baked script' and running on both to see the output.

Codex and Claude are obviously very different, though it's hard to characterize how those differences might apply exactly to a given problem.

Two 'very different power saws' will ultimately build the same home.

> A colleague was convinced Claude is better

That’s actually what my comment was based on; raw code output isn’t the only measure of quality. Engineers write better code if they have the tools they prefer.

The colleague participated in the test though, so apparently the colleague didn't object to "better" being interpreted as "makes better output".