Hacker News new | ask | show | jobs
by chromejs10 314 days ago
This should have been compared with Opus... I know OP says he didn't because of cost but if you're comparing who is better then you need to compare the best to the best... if Claude Opus 4.1 is significantly better than GPT 5 then that could offset the extra expense. Not saying it will... but forget cost if we want to compare solely the quality
6 comments

For what it's worth, I've been trying Opus 4.1 in VS Code through GitHub Copilot and it's been really bad. Maybe worse than Sonnet and GPT 4.1. I'm not sure why it was doing so poorly.

In one instance, I asked it to optimize a roughly 80 line C# method that matches some object positions by object ID and delta encodes their positions from the previous frame. It seemed to be confused about how all this should work and output completely wrong code. It has all the context it needs in the file and the method is fairly self-contained. Other models did much better. GPT-5 understood what to do immediately.

I tried a few other tasks/questions that also had underwhelming results. Now I've switched to using GPT-5.

If you have a quick prompt you'd like me to try, I can share the results.

Use Claude Code, the rest aren't worth the bother.
What does Claude Code do differently to Copilot Agent? Shouldn't they produce the same(ish) result if they're using the same model?
If they prompt the same and ..., They should.

But they definitely don't taking into account whatever prompts the tools are really using (or ms is using a neutered version to reduce cost). So I would agree with the suggestion. Using sonnet through copilot seems very very different than cursor or cline or Claude code.

Using the same exact model, Copilot consistently often fails to finish tasks or makes a mess. It is consistent at this across ides (ie using the jetbrains plugin generates nearly identical bad results as vscode copilot). I then discard all it did and try the exact same (user) prompt in cursor or Claude code or cline with the same model and it does the same task perfectly.

I've used both aider and opencode with both Opus and Sonnet. Opencode, at least initially, used Claude Code's exact prompt; and I found the results surprisingly different.

Perhaps it shouldn't be surprising; after all, we do want the LLMs to listen to the prompts and act differently. And, the Claude team will presumably be tuning both Claude and Claude Code's prompts to each other optimize their own experience, so it's perhaps not surprising that Claude + Claude Code's prompts well together.

Copilot sucks more at applying what the model is instructing it to do
To me it seems that Opus is really good at writing code if you give it a spec. The other day I had Gpt come up with a spec for a DnD text game that uses the GPT API. It one shotted a 1k line program.

However, if I'm not detailed with it, it does seem to make weird choices that end up being unmaintainable. It's like it has poor creative instincts but is really good at following the directions you give it.

Wait, are you talking about Opus or GPT? Which GPT? You switched models mid-sentence.
GPT 4o came up with a design spec that I gave to Opus to implement.
Opus seems to need more babysitting IME, which is great if you are going to actually pair program. Terrible if you like leaving it to do its own thing or try to do multiple things at once.
I just want a model that feels like an extension of me. For example if I there's a task I can describe in one sentence - "add a rest api for user management in the db, and makes sure only users in the admin group are allowed to use it" - would result in an API endpoint that's properly wired up to the right places, and the model does what I tell it, and nothing else, even if it would logically follow from what I told it.

And if it's gets confused, needs clarification, or has its own initative - I want it to stop and ask.

Oh and it needs to be fast it's tokens per minute should be as fast as I can read what it generates (and I can read boilerplate-y code quite fast), and it shouldn't stop and think on every prompt, only when it needs to, and it should be much faster and granular in backtracking.

The loop of waiting on the AI then having to fix and steer it constantly as it doggedly follows its own ideas has really taken the enjoyment out of vibe coding for me.

Have it break the problem into phases. Have it unit testing after every phase. Only move forward after all the test for the phase have passed. I’m using the free Qwen3-Coder and with proper prompting is fairly good.
That's insightful.

I spend a lot of time planning tasks, generating various documents per pr (requirements, questions, todo), having AI poke my ideas (business/product/ux/code-wise) etc.

After 45 minutes of back and forth in general we end up with a detailed plan.

This has also many benefits: - writing tests becomes very simple (unit, integration, E2Es) - writing documentation becomes very simple - writing meaningful PRs becomes very simple

It is quite boring though, not gonna lie. But that's a price I have accepted for quality.

Also, clearing the ideas so much before hand often leads me to come with creative ideas later in the day, when I go for walks and review mentally what we've done/how.

You might want to try Claude Code if you haven't. It's perfect for exactly this plan, then build flow with a ton of documents. A colleague set up some strict code guidelines, right down to say, put constructors at the top, constants at the bottom, use this name for this, snake case for that. Code quality just shoots up with these details. Can't just hack away with a blunt axe.

People tend to hate Claude Code because it's not vibe coding anymore but it was never really meant to be.

Yes I use Claude Code a lot, but I'm on the $ 20 tier so I've never seen opus in action (I think it's sonnet only?).
Opus costs 10X more. Maybe it's better, but I can't afford to use it, so who cares.
re: the comments that Opus is not cost effective...The whole sales pitch behind these tools, and quite specifically the pitch OpenAI made yesterday, is that they will replace people, specifically programmers. Opus is cheaper than a US-based engineer. It's totally reasonable to use it as the benchmark if it's best.

Also keep in mind that many employees are not paying out of pocket for LLM use at work. A $1,000 monthly bill for LLM usage is high for an individual but not so much for a company that employees engineers.

My experience with coding agents is they need a lot of hand-holding.

They're impressive despite that. But if Sonnet is $20/month and I have to intervene every 3 minutes, while Opus is $100/month and I have to intervene every 5 minutes? ¯\_(ツ)_/¯

> My experience with coding agents is they need a lot of hand-holding.

So do engineers.

The difference is that IRL engineers know a lot about the context of the business, features, product, ux, stakeholders, expectations, etc, etc which means that the hand-holding is a long running process.

LLMs need all of these things to be clearly written down and specified in one shot.

Really depends on who's paying the bill, and how much gets done between interventions, right?

Inverting the problem, one might ask how best to spend (say) $5,000 monthly on coding agents. I don't know the answer to that.

> but forget cost if we want to compare solely the quality

I think this is the whole reason not to compare it to Opus...

I agree. Opus is cost prohibitive for most longer coding tasks. The increase output doesn't justify the cost.
You compare what can be used by most engineers. Most engineers are not going to spend that insane price of Opus. It's extremely high compared to all other models, so even if it is slightly better, it's a non-starter for engineering workloads.
> t insane price of Opus

I believe Opus starts at $20 a month, similar to GPT5 if you want more than just cursory usage.

Or am I missing something?

For $20/month you get Opus in-browser chat access, and Sonnet claude code access.

If you want to use Opus in claude code, you've got to get the $100/month plan - or pay API prices. And agentic coding uses a lot of tokens.

Yes you are missing something:

    Claude Opus 4.1

    Most intelligent model for complex tasks
    Input  $15 / MTok
    Output $75 / MTok
    Prompt caching
      Write $18.75 / MTok
      Read  $1.50 / MTok
I see. Do you know of a resource that does an across-the board apples-to-apples comparison between the different services (knowing they all price slightly differently)?

It would be useful to be able to easily compare what it costs across the big providers: Gemini, Grok, Claude, ChatGPT.

Most engineers spending their own money maybe, but the cost of Opus is not that much compared to the output when the company is paying for it.
gpt-5 isn't supposed to be the best, it's supposed to be cost effective
From OpenAI website:

> Our smartest, fastest, and most useful model yet

I'd say it's definitely supposed to be the best, it just doesn't deliver.

>> Our smartest, fastest, and most useful model yet

> I'd say it's definitely supposed to be the best, it just doesn't deliver.

What part of "Our" is difficult to understand in that statement? Or are you claiming that OpenAI owns another model that is clearly better than GPT-5?

Not the person you were responding to but, if a company provides a service, they want you to use it instead of their competitors. No company is going to say “use ours unless you want to use the best, then use our competitor’s”. so even though I agree with you that they are not explicitly saying “this is the best model in the world”, they are definitely saying “hey this is the best we got, use it”.
Ah so your reading of that statement is "our best, which we acknowledge is not THE best, but it's not supposed to, it's supposed to be cost effective"?

I would suggest reading the entire comment thread before attacking people.

I was going by what sama was saying on twitter. he was mostly hyping the cost-effectiveness of it. which can be considered a factor what is "best" too.