| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by EmanuelB 105 days ago

I can't notice any difference to 4.6 from 3 weeks ago, except that this model burns way more tokens, and produces much longer plans. To me it seem like this model is just the same as 4.6 but with a bigger token budget on all effort levels. I guess this is one way how Anthropic plans to make their business profitable.

During the past weeks of lobotomized opus, I tried a few different open weight models side by side with "opus 4.6" on the same issue. The open weights outperformed opus 4.6, and did it way faster and cheaper. I tried the same problem against Opus 4.7 today and it did manage to find one additional edge case that is not critical, but should be logged. So based on my experience, the open weight models managed to solve the exact problem I needed fixed, while Opus 4.7 seem to think a bit more freely at the bigger picture. However Opus 4.7 also consumed way more tokens at a higher price, so the price difference was 10-20x higher on Opus compared to the open weights models. I will use Opus for code review and minor final fixes, and let the open weights models do the heavy lifting from now on. I need a coding setup I can rely on, and clearly Anthropic is not reliable enough to rely on.

Why pay 200$ to randomly get rug-pulled with no warning, when I can pay 20$ for 90% of the intelligence with reliable and higher performance?

8 comments

elAhmo 105 days ago

Its funny to think that with a model release Anthropic can slide in some instructions ("be a bit more detailed" or something similar) that affect the token output by a few percent, 5-10%, which will not be noticeable by most users but over the course of the year would bring solid growth (once the VC craze is over, if ever) and increase income.

"Regular companies" would love to have a growth like that without effectively doing anything.

weird-eye-issue 105 days ago

I like how some people are accusing them of reducing the overall token usage to screw over Claude Code users and then there are yet other people that are accusing them of deliberately increasing token usage to screw over API users (or maybe to get subscription users to upgrade, I'm not really sure)

doix 105 days ago

I suspect the real issue is that they just change stuff "randomly" and the experience gets worse/better cheaper/more expensive.

Since you have no way of knowing when they change stuff, you can't really know if they did change something or it's just bias.

I've experienced that so many times in the last month that I switched to codex. The worst part is, it could be entirely in my head. It's so hard to quantify these changes, and the effort it takes isn't worth it to me. I just go by "feeling".

wat10000 104 days ago

They don't even need to do anything. LLMs are effectively random anyway. Even ignoring temperature and inadvertent nondeterminism in inference, the change in outputs from a change in inputs is unpredictable and basically pseudorandom. That's not to say they aren't useful, just that Anthropic could make zero changes and people would still see variations that they'd attribute to malice.

1dom 105 days ago

The issue is business and transparency. Transparency is often in the customer's interest at the individual business's expense.

There are very, very few things that can be completely transparent without giving competitors an advantage. The nice solution solution to this is to be better and faster than your competitors, but sometimes it's easier just to remove transparency.

ethbr1 104 days ago

I expect "model transparency" to become the new "SSO" enterprise feature differentiator.

Enterprise use cases have to have it (or else pawn the YOLO off on their users), so it will be a key way to bucket customers into non-enterprise vs enterprise pricing.

edgolub 105 days ago

Nobody is accusing them of making the models more efficient.

People are complaining they are changing how many tokens you get on a subscription plan.

Why would anyone dislike getting more service for less (or the same) amount of money?

weird-eye-issue 105 days ago

> People are complaining they are changing how many tokens you get on a subscription plan.

They didn't change this. It's the same number of tokens just a different tokenizer.

esperent 105 days ago

They absolutely do change this all the time - session limits vary wildly. The most damning proof of this is that there's absolutely no information about how many tokens you get per session with each subscription level, it's just terms like 5x, 20x. But 5x what? Who knows?

weird-eye-issue 105 days ago

That's not proof of anything. Also the usage is not solely based on tokens because you also have to factor in things like prompt caching costs (and savings). So it's based on the actual API cost.

verve_rat 105 days ago

You and I have no way of knowing that.

weird-eye-issue 105 days ago

Except that the API cost is literally logged on disk for every session and it's easy to analyze those logs.

EmanuelB 105 days ago

I think this is the case. In the early GPT-4 days I tested the same model side by side across the subscription and API. The API always produced a longer better answer. To me it felt like the API model was working how it was supposed to work while the subscription model tried to reduce its token usage. From a business perspective that would make sense. I then switched to API only because I felt like it was worth the extra cost.

I did a similar test with sonnet about 6 months ago and noticed no difference, except that the subscription was way cheaper than API access. This is not the case anymore, at least not for me. The subscription these days only lasts for a few requests before it hits the usage limit and goes over to ”extra usage” billing. Last week I burned through my entire subscription budget and 80$ worth of extra usage in about 1h. That is not sustainable for me and the reason I started looking at alternatives.

From a business perspective it all makes sense. Anthropic recently gave away a ton of extra usage for free. Now people have balance on their accounts that Anthropic needs to pay for with compute, suddenly they release a model that seem to burn those tokens faster than ever. Last week I felt like the model did the opposite, it was stopping mid implementation and forgetting things after only 2 turns. Based on the responses I got it seemed like they were running out of compute, lobotomized their model and made it think less, give shorter answers etc. Probably they are also doing A/B testing on every change so my experience might be wildly different from someone else.

weitendorf 105 days ago

The UIs all bake in system prompts and other tunable configs that the API leaves open, so does Claude Code and other harnesses. So anything you notice different over the API when you're controlling the client is almost certainly that. Note that this is kind of something they have to do because consumer UI users will do stuff like ask models their name or date, or want it to respond politely and compassionately, and get upset/confused when they just get what's in the weights.

The problem with subscriptions for this kind of stuff is that it's just incompatible with their cost structure. The worst being, subscription usage is going to follow a diurnal usage pattern that overlaps with business/API users, so they're going to have to be offloaded to compute partners who most likely charge by the resource-second. And also, it's a competitive market, anybody who wants usage-based pricing can just get that.

So you basically end up with adverse selection with consumer subscription models. It's just kind of an incoherent business model that only works when your value proposition is more than just compute (which has a usage-based, pretty fungible market)

weird-eye-issue 105 days ago

> In the early GPT-4 days I tested the same model side by side across the subscription and API. The API always produced a longer better answer.

If you are comparing responses in ChatGPT to the API, it's apples and oranges, since one applies a very opinionated system prompt and the other does not.

Since you haven't figured that out in 3 years, I didn't bother reading the rest of your comment.

Natfan 105 days ago

this comment feels pretty rude and disrespectful for no real reason?

mattdmrs 104 days ago

I don’t know about ChatGPT, but in Claude Code I _have_ been able to do a side-by-side comparison of API-based metered billing vs subscription billing, in the same UI. You just switch from one to the other using /login.

You should probably not be so quick to dismiss what people say as nonsense.

rrr_oh_man 105 days ago

It's almost as if there are different people with different motivations and ideas about how the world should work

HarHarVeryFunny 104 days ago

They have switched tokenizer to one that generates 1-1.35x (i.e. up to 35% more) tokens for the same input.

They have changed default CC effort to xhigh.

They have said that Opus 4.7 will generate more tokens than 4.6 at same effort level.

They have increased their image input resolution meaning more tokens per image.

etc.

Maybe they are also extracting another 5% tokens from you by prompting it to not talk like a caveman, but that would hardly be noticeable.

tomjakubowski 104 days ago

Like a reverse speed-up loop https://thedailywtf.com/articles/The-Speedup-Loop

paulluuk 105 days ago

If open weight models are sufficient for your engineering problems, then you should absolutely use them. But I haven't seen a single open weight model that can get even close to the complexity in my projects. They sometimes work for small toy examples or leetcode puzzles, but not very any real project. Really curious what models you've found that could replace current state of the art.

berkes 105 days ago

I've been using devstral2 with great success for a few months now. The hosted version, not running one locally or such. Devstral is open.

Devstral is good, Opus better. But not much. For me, "good" is "good enough". The difference, IME lies in context engineering: skills, agents.md, subagents, tools, prompts. A Devstral with good skills performs far better than an "blank" claude code. Claude with good skills performs even better, but hardly noticable, IME.

I am convinced I've plateaued. Better performance comes from improving skills and other "memory", prompting smarter, better context management and, above all, from the tooling around it and the stability of the services.

I do still run Claude with Opus alongside Mistral with Devstral2. Sometimes to just compare outputs, often to doublecheck, but mostly to doublecheck my statement that the difference between Devstral2 and Opus is marginally and easily covered by better context engineering.

port11 104 days ago

Perhaps. I’d like to like Devstral because I’d rather give my money to an European business.

My experience with it in an existing codebase has been that it gets to results much more reliably than Gemini Flash or Haiku, but it will cut corners and write incomprehensible code even with a good Opus plan to boot.

It’s true that the context and tooling might help, but setting everything up and finding the arcane mix of correct MCPs/skills is a job in itself right now. What I do see is that I’ve wasted months trying to get good code out of Gemini, Devstral2, and a good experience out of stuff like OpenCode and everything under the sun.

berkes 104 days ago

> is a job in itself right now.

Yes, exactly. I consider this the core of my job now: herding agents.

I reminds me of the time that I "herded" juniors, interns and new hires very much.

And my experience is that OpenCode et.al. don't do a "Good Enough" job. It's better, than e.g. Devstral2, but without guidance, still not sufficient. I think that mostly has to do with a combination of my experience and standards and of my languages and niches.

All of them are good enough for throwing out a react spagetti, one you'd expect from fiverr or from an intern: don't look under the hood, just drive it (launch it and leave it). Claude is far better in such a "benchmark" than e.g. Devstral2.

But when I need a hexagonal-architectured, TDD and BDD covered microservice in python with zero type warnings, all models fail spectacularly out of the box. I presume their training body isn't "used" to such patterns: it's statistically unlikely to ignore type warnings in Python (wink). Just like it's statistically unlikely to write a few files of typescript for a feature, instead of pulling in an node package. Turns out esp. with claude code, it's statistically likely to comment out failing tests if the rule is "ensure all test pass" and this one hard to fix¹.

So to get this level of what we require, I need tons of rules, guidelines, skills and whatnot. On every model. So I'll just as well - indeed - pipe my money into an EU company that's cheaper and has the option of self-hosting when s* starts hitting fans.

--- ¹ I think I finally found the "context" to fix this, though. What I used to tell my interns/juniors is to take a step back and re-think the shape of things: a difficult or complex test usually means the code it is testing needs re-architecturing. Something most agents will refuse: and good, because it's side-tracking them. My solution is to tell agents to stop, document the problem, and if obvious, document the solution as well in a dedicated "technical debt" markdown file. Then in future I'll direct another agent at this file and tell it to start fixing them one at a time.

port11 102 days ago

I agree with all you’ve said.

Gemini loves deleting tests as well, and all of them will relentlessly stub things to make unit tests ‘easy’.

What experience brought me is knowing where to steer them, e.g. scraping all their shitty glue code and hand-holding Sonnet into implementing classes, DI, and unit tests that aren’t brittle at all. In that way, the agents have been nice to work with: they remind us of why cleaner code and good practices make for maintainable code. I hate their React spaghetti, but most places I’ve worked had tons of React spaghetti anyway…

All of this said: I actually miss steering juniors instead. Humans are frustrating to work with, but they are also adaptable, grow with time, and are… you know, human.

Mentoring Claude isn’t exactly fun or rewarding, in the way mentoring a colleague would be. And thankfully we have memory MCP servers, otherwise it would be like mentoring a brand new intern every time you fire up Claude.

berkes 105 days ago

Someone just asked my what I dislike most about Mistral and about Claude code.

I run both in zed editor. Claude codes' integration is subpar - it's ACP does not report tasks, doesn't give diffs and so on.

Mistral has rate limits that I hit just too often. I'm now using Mistral Pro, where this is worse, using pay-as-you-go is better but costs me 10x the pro. The agent then stops with an error.

weitendorf 105 days ago

I find the most value to be in eval loops and multi-agent setups where a specialized or cheap model gets tasks that take load off the smarter model.

Most of the value in agentic development IMO is in the feedback loop/ability for the model itself to intelligently pull in context, but if you want to push a lot of context or have steps that are more proscribed, it's kind of a waste of money to have the big model do that. Much better to use it as a kind of pre-processing/noise-reduction step that filters out junk context.

I would say that right now the benefits are largest for this kind of work with medium-sized multimodal models. For example I have hooks/automation that use https://github.com/accretional/chromerpc to automatically screenshot UIs and then feed it into qwen-family models. It's more that I don't want to pay Opus to look at them or remember/be instructed to do that unless it goes through QA first.

embedding-shape 105 days ago

> I find the most value to be in eval loops and multi-agent setups where a specialized or cheap model gets tasks that take load off the smarter model.

Yes, in theory, this should hold up, at least according to evaluations.

According to real, practical use though, none of the open weight models are generally strong enough to handle coding and programming in a professional environment though, unless you have tightly controlled scope and specialized models for those scopes, which generally I don't think you have, but maybe it's just me jumping around a lot.

Even with feedback loops, harnesses and what not, even the strongest local models I can run with 96GB of VRAM don't seem to come close to what OpenAI offered in the last year or so. I'm sure it'll be ready at one point, but today it isn't.

With that said, if you know specific models you think work well as a general and local programming models, please share which ones, happy to be shown wrong. Latest I've tried was Qwen3.6-35B-A3B which gets a bit further but still instruction following is a far cry from what OpenAI et al offered for years.

fragmede 105 days ago

qwen3.6-35b-a3b, released today.

https://qwen.ai/blog?id=qwen3.6-35b-a3b

https://news.ycombinator.com/item?id=47792764

otabdeveloper4 105 days ago

Fundamentally they're the same technology with the same exact algorithms under the hood; only the post-training alignment differs.

That is, the difference you see is either placebo effect or you being lucky and better aligning with model post-training bias.

paulluuk 105 days ago

Sorry, I was not specific enough. I did not mean that open source itself is not enough, I meant that an open source model that can actually run locally on my machine is not enough. a 32B model can not compete with a 250B+ state of the art model, at least that's my experience and seems to be the experience of many others as well.

eloisant 105 days ago

Yes they're not as powerful, that means you need to feed them smaller tasks and rely more on plan mode.

otabdeveloper4 104 days ago

Saying it "cannot compete" is like saying that a Kia cannot compete with a BMW.

Technically true in some sense, but fundamentally the two are the same exact thing and it's highly unlikely you have a task that actually requires a BMW.

brunooliv 105 days ago

Also my experience

parasti 105 days ago

Which open weights model?

InvisGhost 105 days ago

It goes to a different school, you wouldn't know if

Scrounger 105 days ago

> Which open weights model?

Yes, I'm also wondering!

Currently I'm testing out gemma4:26b and qwen3.6:35b-a3b-q4_K_M locally on my M2 Max Macbook Pro.

Not the fastest, but reasonable.

However, I am also interested in getting as close as possible in performance to Opus 4.6 while minimizing my costs.

hk__2 105 days ago

> I am also interested in getting as close as possible in performance to Opus 4.6 while minimizing my costs.

Aren’t we all? ;)

itsdavesanders 104 days ago

Remember, Open Weight doesn't necc. mean local. They are probably running on a larger version online, closer to Claude specs. (lol and probably distilled from Claude)

taffydavid 105 days ago

Gemma4 on an m2? That sounds promising. I have an m3 max, going to try that today

misja111 105 days ago

I'm actually seeing a similar thing when comparing 4.6 and 4.5. It burns a lot more tokens, does show more how it is thinking along the way, but I don't see a strong difference in the end result. Occasionally 4.6 even seems to get stuck in its 'processing' phase, while 4.5 doesn't on the same task.

spaceman_2020 105 days ago

Yeah my rate limits are getting exhausted way faster now. Its also way slower and overplans unless you steer it closely.

I can’t rely on this anymore.

sanderjd 104 days ago

Which open weights models did you use for this comparison, and how are you running them?

mattmanser 105 days ago

I just don't believe you.

The vast gulf between open weights and frontier models that existed 6 months ago has suddenly disappeared?

It's far more likely you're just bad at assessing model output.

jamiejquinn 105 days ago

Or that gulf doesn't exist for the problems they are trying to solve?

michaelscott 105 days ago

Their problem space may be just fine with open weight models regardless, but yes the release of gemma 4, GLM 5.1 and qwen 3.5 (and now 3.6!) have all happened in the last 6 months

weird-eye-issue 105 days ago

> Why pay 200$ to randomly get rug-pulled with no warning, when I can pay 20$ for 90% of the intelligence with reliable and higher performance?

Then go do that. Good luck!