| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by andrewchilds 117 days ago

Many people have reported Opus 4.6 is a step back from Opus 4.5 - that 4.6 is consuming 5-10x as many tokens as 4.5 to accomplish the same task: https://github.com/anthropics/claude-code/issues/23706

I haven't seen a response from the Anthropic team about it.

I can't help but look at Sonnet 4.6 in the same light, and want to stick with 4.5 across the board until this issue is acknowledged and resolved.

24 comments

wongarsu 117 days ago

Keep in mind that the people who experience issues will always be the loudest.

I've overall enjoyed 4.6. On many easy things it thinks less than 4.5, leading to snappier feedback. And 4.6 seems much more comfortable calling tools: it's much more proactive about looking at the git history to understand the history of a bug or feature, or about looking at online documentation for APIs and packages.

A recent claude code update explicitly offered me the option to change the reasoning level from high to medium, and for many people that seems to help with the overthinking. But for my tasks and medium-sized code bases (far beyond hobby but far below legacy enterprise) I've been very happy with the default setting. Or maybe it's about the prompting style, hard to say

evilhackerdude 117 days ago

keep in mind that people who point out a regression and measure the actual #tok, which costs $money, aren't just "being loud" — someone diffed session context usaage and found 4.6 burning >7x the amount of context on a task that 4.5 did in under 2 MB⁣.

svachalek 117 days ago

It's not that they don't have a point, it's that everyone who's finding 4.6 to be fine or great are not running out to the internet to talk about it.

marcus_cemes 117 days ago

Being a moderately frequent user of Opus and having spoken to people who use it actively at work for automation, it's a really expensive model to run, I've heard it burn through a company's weekend's credit allocation before Saturday morning, I think using almost an order of magnitude more tokens is a valid consumer concern!

I have yet to hear anyone say "Opus is really good value for money, a real good economic choice for us". It seems that we're trying to retrofit every possible task with SOTA AI that is still severely lacking in solid reasoning, reliability/dependability, so we throw more money at the problem (cough Opus) in the hopes that it will surpass that barrier of trust.

SatvikBeri 117 days ago

I've also seen Opus 4.6 as a pure upgrade. In particular, it's noticeably better at debugging complex issues and navigating our internal/custom framework.

drcongo 117 days ago

Same here. 4.6 has been considerably more dilligent for me.

AustinDev 117 days ago

Likewise, I feel like it's degraded in performance a bit over the last couple weeks but that's just vibes. They surely vary thinking tokens based on load on the backend, especially for subscription users.

When my subscription 4.6 is flagging I'll switch over to Corporate API version and run the same prompts and get a noticeably better solution. In the end it's hard to compare nondeterministic systems.

merlindru 117 days ago

That's very interesting!

Also, +1. Opus 4.6 is strictly better than 4.5 for me

perelin 117 days ago

Mirrors my experience as well. Especially the pro-activeness in tool calling sticks out. It goes web searching to augment knowledge gaps on its own way more often.

galaxyLogic 117 days ago

Do you need to upload your git for it to analyuze it? Or are they reading it off github ?

gpm 117 days ago

They're probably running it with a claude code like tool and it has a local (to the tool, not to anthropic) copy of the git repo it can query using the cli.

MrCheeze 117 days ago

In my experience with the models (watching Claude play Pokemon), the models are similar in intelligence, but are very different in how they approach problems: Opus 4.5 hyperfocuses on completing its original plan, far more than any older or newer version of Claude. Opus 4.6 gets bored quickly and is constantly changing its approach if it doesn't get results fast. This makes it waste more time on"easy" tasks where the first approach would have worked, but faster by an order of magnitude on "hard" tasks that require trying different approaches. For this reason, it started off slower than 4.5, but ultimately got as far in 9 days as 4.5 got in 59 days.

bjt12345 117 days ago

I think that's because Opus 4.6 has more "initiative".

Opus 4.6 can be quite sassy at times, the other day I asked it if it were "buttering me up" and it candidly responded "Hey you asked me to help you write a report with that conclusion, not appraise it."

KronisLV 117 days ago

I got the Max subscription and have been using Opus 4.6 since, the model is way above pretty much everything else I've tried for dev work and while I'd love for Anthropic to let me (easily) work on making a hostable server-side solution for parallel tasks without having to go the API key route and not have to pay per token, I will say that the Claude Code desktop app (more convenient than the TUI one) gets me most of the way there too.

alkhatib 117 days ago

Try https://conductor.build

I started using it last week and it’s been great. Uses git worktrees, experimental feature (spotlight) allows you to quickly check changes from different agents.

I hope the Claude app will add similar features soon

bredren 117 days ago

Can you explain what you mean by your parallel tasks limitation?

KronisLV 117 days ago

Instead of having my computer be the one running Claude Code and executing tasks, I might want to prefer to offload it to my other homelab servers to execute agents for me, working pretty much like traditional CI/CD, though with LLMs working on various tasks in Docker containers, each on either the same or different codebases, each having their own branches/worktrees, submitting pull/merge requests in a self-hosted Gitea/GitLab instance or whatever.

If I don't want to sit behind something like LiteLLM or OpenRouter, I can just use the Claude Agent SDK: https://platform.claude.com/docs/en/agent-sdk/overview

However, you're not supposed to really use it with your Claude Max subscription, but instead use an API key, where you pay per token (which doesn't seem nearly as affordable, compared to the Max plan, nobody would probably mind if I run it on homelab servers, but if I put it on work servers for a bit, technically I'd be in breach of the rules):

> Unless previously approved, Anthropic does not allow third party developers to offer claude.ai login or rate limits for their products, including agents built on the Claude Agent SDK. Please use the API key authentication methods described in this document instead.

If you look at how similar integrations already work, they also reference using the API directly: https://code.claude.com/docs/en/gitlab-ci-cd#how-it-works

A simpler version is already in Claude Code and they have their own cloud thing, I'd just personally prefer more freedom to build my own: https://www.youtube.com/watch?v=zrcCS9oHjtI (though there is the possibility of using the regular Claude Code non-interactively: https://code.claude.com/docs/en/headless)

It just feels a tad more hacky than just copying an API key when you use the API directly, there is stuff like https://github.com/anthropics/claude-code/issues/21765 but also "claude setup-token" (which you probably don't want to use all that much, given the lifetime?)

DaKevK 117 days ago

Genuinely one of the more interesting model evals I've seen described. The sunk cost framing makes sense -- 4.5 doubles down, 4.6 cuts losses faster. 9 days vs 59 is a wild result. Makes me wonder how much of the regression complaints are from people hitting 4.6 on tasks where the first approach was obviously correct.

MrCheeze 117 days ago

Notably 45 out of the 50 days of improvement were in two specific dungeons (Silph Co and Cinnabar Mansion) where 4.5 was entirely inadequate and was looping the same mistaken ideas with only minor variation, until eventually it stumbled by chance into the solution. Until we saw how much better it did in those spots, we weren't completely sure that 4.6 was an improvement at all!

https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQDvsy5D...

Jach 117 days ago

I haven't kept up with the Claude plays stuff, did it ever actually beat the game? I was under the impression that the harness was artificially hampering it considering how comparatively more easily various versions of ChatGPT and Gemini had beat the game and even moved on to beating Pokemon Crystal.

MrCheeze 117 days ago

The Claude Plays Pokemon stream with a minimal harness is a far more significant test of model intelligence compared to the Gemini Plays Pokemon stream (which automatically maintains a map of everything that has been seen on the current map) and the GPT Plays Pokemon stream (which does that AND has an extremely detailed prompt which more or less railroads the AI into not making this mistakes it wants to make). The latter two harnesses have become too easy for the latest generations of model, enough so that they're not really testing anything anymore.

Claude Plays Pokemon is currently stuck in Victory Road, doing the Sokoban puzzles which are both the last puzzles in the game and by far the most difficult for AIs to do. Opus 4.5 made it there but was completely hopeless, 4.6 made it there and is is showing some signs of maaaaaybe being eventually bruteforce through the puzzles, but personally I think it will get stuck or undo its progress, and that Claude 4.7 or 5 will be the one to actually beat the game.

donovandikaio 104 days ago

Opus 4.6 has been a hit-and-miss for me. It does extremely well on very complex, long-running tasks but also struggles with very basic, seemingly straightforward work and often provides conflicting recommendations. For example, just this morning Opus 4.6 provided two options, recommended option 1, and at the end of the same message asked to start option 2; this does not happen in Opus 4.5.

For now, my workflow will be for everyday tasks claude-opus-4-5 and opus 4.6 for more complex work.

data-ottawa 117 days ago

I think this depends on what reasoning level your Claude Code is set to.

Go to /models, select opus, and the dim text at the bottom will tell you the reasoning level.

High reasoning is a big difference versus 4.5. 4.6 high uses a lot of tokens for even small tasks, and if you have a large codebase it will fill almost all context then compact often.

minimaxir 117 days ago

I set reasoning to Medium after hitting these issues and it did not make much of a difference. Most of the context window is still filled during the Explore tool phase (that supposedly uses Haiku swarms) which wouldn't be impacted by Opus reasoning.

ramon156 109 days ago

Lol, I went to change this setting only to realize it was already set to Medium

_zoltan_ 117 days ago

I'm using the 1M context 4.6 and it's great.

honeycrispy 117 days ago

Glad it's not just me. I got a surprise the other day when I was notified that I had burned up my monthly budget in just a few days on 4.6

Topfi 117 days ago

In my evals, I was able to rather reliably reproduce an increase in output token amount of roughly 15-45% compared to 4.5, but in large part this was limited to task inference and task evaluation benchmarks. These are made up of prompts that I intentionally designed to be less then optimal, either lacking crucial information (requiring a model to output an inference to accomplish the main request) or including a request for a less than optimal or incorrect approach to resolving a task (testing whether and how a prompt is evaluated by a model against pure task adherence). The clarifying question many agentic harnesses try to provide (with mixed success) are a practical example of both capabilities and something I do rate highly in models, as long as task adherence isn't affected overly negatively because of it.

In either case, there has been an increase between 4.1 and 4.5, as well as now another jump with the release of 4.6. As mentioned, I haven't seen a 5x or 10x increase, a bit below 50% for the same task was the maximum I saw and in general, of more opaque input or when a better approach is possible, I do think using more tokens for a better overall result is the right approach.

In tasks which are well authored and do not contain such deficiencies, I have seen no significant difference in either direction in terms of pure token output numbers. However, with models being what they are and past, hard to reproduce regressions/output quality differences, that additionally only affected a specific subset of users, I cannot make a solid determination.

Regarding Sonnet 4.6, what I noticed is that the reasoning tokens are very different compared to any prior Anthropic models. They start out far more structured, but then consistently turn more verbose akin to a Google model.

weinzierl 117 days ago

Today I asked Sonnet 4.5 a question and I got a banner at the bottom that I am using a legacy model and have to continue the conversation on another model. The model button had changed to be labeled "Legacy model". Yeah, I guess it wasn't legacy a sec ago.

(Currently I can use Sonnet 4.5 under More models, so I guess the above was just a glitch)

etothet 117 days ago

I definitely noticed this on Opus 4.6. I moved back to 4.5 until I see (or hear about) an improvement.

hedora 117 days ago

I’ve noticed the opaque weekly quota meter goes up more slowly with 4.6, but it more frequently goes off and works for an hour+, with really high reported token counts.

Those suggest opposite things about anthropic’s profit margins.

I’m not convinced 4.6 is much better than 4.5. The big discontinuous breakthroughs seem to be due to how my code and tests are structured, not model bumps.

ctoth 117 days ago

For me it's the ... unearned confidence that 4.5 absolutely did not have?

I have a protocol called "foreman protocol" where the main agent only dispatches other agents with prompt files and reads report files from the agents rather than relying on the janky subagent communication mechanisms such as task output.

What this has given me also is a history of what was built and why it was built, because I have a list of prompts that were tasked to the subagents. With Opus 4.5 it would often leave the ... figuring out part? to the agents. In 4.6 it absolutely inserts what it thinks should happen/its idea of the bug/what it believes should be done into the prompt, which often screws up the subagent because it is simply wrong and because it's in the prompt the subagent doesn't actually go look. Opus 4.5 would let the agent figure it out, 4.6 assumes it knows and is wrong

DaKevK 117 days ago

Have you tried framing the hypothesis as a question in the dispatch prompt rather than a statement? Something like -- possible cause: X, please verify before proceeding -- instead of stating it as fact. Might break the assumption inheritance without changing the overall structure.

nwienert 117 days ago

After a month of obliterating work with 4.5, I spent about 5 days absolutely shocked at how dumb 4.6 felt, like not just a bit worse but 50% at best. Idk if it's the specific problems I work on but GP captured it well - 4.5 listened and explored better, 4.6 seems to assume (the wrong thing) constantly, I would be correcting it 3-4 times in a row sometimes. Rage quit a few times in the first day of using it, thank god I found out how to dial it back.

ctoth 117 days ago

Here's the part where you don't leave us all hanging? What did you figure out!!!

obmelvin 117 days ago

I believe they just mean setting the model back to 4.5

nerdsniper 117 days ago

In terms of performance, 4.6 seems better. I’m willing to pay the tokens for that. But if it does use tokens at a much faster rate, it makes sense to keep 4.5 around for more frugal users

I just wouldn’t call it a regression for my use case, i’m pretty happy with it.

baq 117 days ago

Sonnet 4.5 was not worth using at all for coding for a few months now, so not sure what we're comparing here. If Sonnet 4.6 is anywhere near the performance they claim, it's actually a viable alternative.

Snakes3727 117 days ago

Imo I found opus 4.6 to be a pretty big step back. Our usage has skyrocketed since 4.6 has come out and the workload has not really changed.

However I can honestly say anthropic is pretty terrible about support, to even billing. My org has a large enterprise contract with anthropic and we have been hitting endless rate limits across the entire org. They have never once responded to our issues, or we get the same generic AI response.

So odds of them addressing issues or responding to people feels low.

cjbarber 117 days ago

I wonder if it's actually from CC harness updates that make it much more inclined to use subagents, rather than from the model update.

j45 117 days ago

I have often noticed a difference too, and it's usually in lockstep with needing to adjust how I am prompting.

Put in a different way, I have to keep developing my prompting / context / writing skills at all times, ahead of the curve, before they're needed to be adjusted.

cheema33 117 days ago

> Many people have reported Opus 4.6 is a step back from Opus 4.5.

Many people say many things. Just because you read it on the Internet, doesn't mean that it is true. Until you have seen hard evidence, take such proclamations with large grains of salt.

OtomotO 117 days ago

Definitely my experience as well.

No better code, but way longer thinking and way more token usage.

DetroitThrow 117 days ago

I much prefer 4.6. It often finds missed edge cases more often than 4.5. If I cared about token usage so much, I would use Sonnet or Haiku.

Foobar8568 117 days ago

It goes into plan mode and/or heavy multiple agent for any reasons, and hundred thousands of tokens are used within a few minutes.

minimaxir 117 days ago

I've been tempted to add to my CLAUDE.md "Never use the Plan tool, you are a wild rebel who only YOLOs."

yakbarber 117 days ago

Opus 4.6 is so much better at building complex systems than 4.5 it's ridiculous.

grav 117 days ago

I fail to understand how two LLMs would be "consuming" a different amount of tokens given the same input? Does it refer to the number of output tokens? Or is it in the context of some "agentic loop" (eg Claude Code)?

lemonfever 117 days ago

Most LLMs output a whole bunch of tokens to help them reason through a problem, often called chain of thought, before giving the actual response. This has been shown to improve performance a lot but uses a lot of tokens

zozbot234 117 days ago

Yup, they all need to do this in case you're asking them a really hard question like: "I really need to get my car washed, the car wash place is only 50 meters away, should I drive there or walk?"

jcims 117 days ago

One very specific and limited example, when asked to build something 4.6 seems to do more web searches in the domain to gather latest best practices for various components/features before planning/implementing.

andrewchilds 117 days ago

I've found that Opus 4.6 is happy to read a significant amount of the codebase in preparation to do something, whereas Opus 4.5 tends to be much more efficient and targeted about pulling in relevant context.

OtomotO 117 days ago

And way faster too!

Gracana 117 days ago

They're talking about output consuming from the pool of tokens allowed by the subscription plan.

bsamuels 117 days ago

thinking tokens, output tokens, etc. Being more clever about file reads/tool calling.

dakolli 117 days ago

I called this many times over the last few weeks on this website (and got downvoted every time), that the next generation of models would become more verbose, especially for agentic tool calling to offset the slot machine called CC's propensity to light the money on fire that's put into it.

At least in vegas they don't pour gasoline on the cash put into their slot machines.

reed1234 117 days ago

not in my experience

reed1234 117 days ago

"Opus 4.6 often thinks more deeply and more carefully revisits its reasoning before settling on an answer. This produces better results on harder problems, but can add cost and latency on simpler ones. If you’re finding that the model is overthinking on a given task, we recommend dialing effort down from its default setting (high) to medium."[1]

I doubt it is a conspiracy.

[1] https://www.anthropic.com/news/claude-opus-4-6

comboy 117 days ago

Yeah, I think the company that opens up a bit of the black box and open sources it, making it easy for people to customize it, will win many customers. People will already live within micro-ecosystems before other companies can follow.

Currently everybody is trying to use the same swiss army knife, but some use it for carving wood and some are trying to make some sushi. It seems obvious that it's gonna lead to disappointment for some.

Models are become a commodity and what they build around them seem to be the main part of the product. It needs some API.

reed1234 117 days ago

I agree that if there was more transparency it might have prevented the token spend concerns, which feels caused by a lack of knowledge about how the models work.

PlatoIsADisease 117 days ago

Don't take this seriously, but here is what I imagined happened:

Sam/OpenAI, Google, and Claude met at a park, everyone left their phones in the car.

They took a walk and said "We are all losing money, if we secretly degrade performance all at the same time, our customers will all switch, but they will all switch at the same time, balancing things... wink wink wink"