| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mikenew 61 days ago
	GLM 5.1 was the model that made me feel like the Chinese models had truly caught up. I cancelled my Claude Max subscription and genuinely have not missed it at all. Some people seem to agree and some don't, but I think that indicates we're just down to your specific domain and usage patterns rather than the SOTA models being objectively better like they clearly used to be.

9 comments

operatingthetan 61 days ago

It seems like people can't even agree which SOTA model is best at any given moment anymore, so yeah I think it's just subjective at this point.

link

fwipsy 61 days ago

Perhaps not even necessarily subjective, just performance is highly task-dependent and even variable within tasks. People get objectively different experiences, and assume one or another is better, but it's basically random.

link

easygenes 61 days ago

Unless you're looking at something like a pass@100 benchmark, the benchmarks are confounded heavily by a likelihood of a "golden path" retrieval within their capabilities. This is on top of uncertainties like how well your task within a domain maps to the relevant test sets, as well as factors like context fullness and context complexity (heavy list of relevant complex instructions can weigh on capabilities in different ways than e.g. having a history where there's prior unrelated tasks still in context).

The best tests are your own custom personal-task-relevant standardized tests (which the best models can't saturate, so aiming for less than 70% pass rate in the best case).

All this is to say that most people are not doing the latter and their vibes are heavily confounded to the point of being mostly meaningless.

link

make3 60 days ago

The pass@100 is such a weird critique angle that is surprisingly mainstream; guess what, no one cares if the correct answer is in the top 100, it needs to be the top 1. A model with a better answer in the top 1 is a better model, full stop.

link

mentalgear 60 days ago

This. Plus if you want to even attempt measuring real 'intelligence' you want to run a neuro-symbolic, de-lexicalized benchmark (e.g. DL-ReasonSuite, SoLT, GSM-Symbolic) - which none of the providers releasing new models showcase.

link

operatingthetan 61 days ago

>just performance is highly task-dependent and even variable within tasks. People get objectively different experiences, and assume one or another is better, but it's basically random.

You are right that this is not exactly subjectivity, but I think for most people it feels like it. We don't have good benchmarks (imo), we read a lot about other people's experiences, and we have our own. I think certain models are going to be objectively better at certain tasks, it's just our ability to know which currently is impaired.

link

Ladioss 60 days ago

SOTA models war is the new console war.

But more seriously, I can't help but be amused by how emotionally invested in their AI brand of choice people are getting.

link

ulfw 61 days ago

AI is a complete commodity

One model can replace another at any given moment in time.

It's NOT a winner-takes-all industry

and hence none of the lofty valuations make sense.

the AI bubble burst will be epic and make us all poorer. Yay

link

StilesCrisis 60 days ago

Staying power is probably the most important factor, which is why I'm thinking Google eventually takes the crown.

link

api 60 days ago

They might be converging somewhat. The ultimate limiting factor is training data. Eventually I think they will converge and then the competition will be on memory and compute efficiency, with the best being the smallest maximally capable model.

link

hamdingers 61 days ago

And the subjectivity is bidirectional.

People judge models on their outputs, but how you like to prompt has a tremendous impact on those outputs and explains why people have wildly different experiences with the same model.

link

scotty79 60 days ago

I had one occasion where GLM 5.1 did about 95% of the implementation that I needed but couldn't progress form there. And Codex (free quota) solved the remaining 5% on the spot. I'm super happy with both. I don't touch anything Anthropic with a 10 foot pole.

link

DeathArrow 60 days ago

>GLM 5.1 was the model that made me feel like the Chinese models had truly caught up. I cancelled my Claude Max subscription and genuinely have not missed it at all.

GLM 5.1 is pretty good but there are some "buts".

They hiked the prices 2 times this year. I subscribed to the pro coding plan just before the last hike. At the start of the year, they had only 5 hours quota and no weekly quota. And I hit the weekly quota hard. I can't upgrade the subscription to get a higher weekly quota because they jacked up the prices a lot recently.

My $30 subscription costs now $72. Previously was $15. Max was $49,then $80 and now $160.

link

_blk 60 days ago

What hardware do you run it on? Trying to consider the cost of subscription + API vs new HW..

link

_s_a_m_ 58 days ago

I used GLM 5.1 and it was bad, I have no clue why people claim it is good

link

LoganDark 61 days ago

The value in Claude Code is its harness. I've tried the desktop app and found it was absolutely terrible in comparison. Like, the very nature of it being a separate codebase is already enough to completely throw off its performance compared to the CLI. Nuts.

link

deaux 61 days ago

> The value in Claude Code is its harness

If this was the case then Anthropic would be in a very bad spot.

It's not, which is why people got so mad about being forced to use it rather than better third party harnesses.

Pi is better than CC as a harness in almost every respect.

link

enochthered 61 days ago

Anthropic limiting Claude subs to Claude code is what pushed me away in the end because I wanted to keep using Pi.

link

strel0k1 61 days ago

Just sign up for an AWS account and use the Anthropic models through Bedrock which Pi can use.

link

seunosewa 61 days ago

API costs are really high compared to subs.

link

solenoid0937 60 days ago

Then you aren't the target market.

link

adrianN 61 days ago

Why use tricks to support a company that is hostile to your use case?

link

deaux 60 days ago

What advantage are you saying this has compared to just directly going through the Anthropic provider? They are the same price.

link

bizzletk 61 days ago

Can you enumerate why?

link

deaux 61 days ago

- Claude Code has repeatedly had enormous token wastage bugs. Its agent interactions are also inefficient. These are the cause of many of the reports of "single prompt blew through 5-hour quota" even though it's a reasonable prompt.

- It still lacks support for industry standards such as AGENTS.md

- Extremely limited customization

- Lots of bugs including often making it impossible to view pre-compaction messages inside Claude Code.

- Obvious one: can't easily switch between Claude and non-Claude models

- Resource usage

More than anything, I haven't found a single thing that Pi does worse. All of it is just straight up better or the same.

link

Mashimo 61 days ago

I thought the desktop app used the cli app in the background?

link

vidarh 60 days ago

I feel like it's Sonnet level for implementation, but not matching up to Opus for planning.

But I agree it's close enough that it's worth using heavily. I've not cancelled my Claude Max subscription, but I've added a z.ai subscription...

link

alfonsodev 60 days ago

My combo is codex and claude basic subscription for planing the hard tasks (if any) opencode with GLM 5.1 (z.ai coding plan) for the actual coding.

opencode is awesome I don't miss cluade or codex cli at all, and the z.ai plan is way more generous in compression.

I was lucky to subscribe to z.ai coding plan pro when it costed 30$/month, I was surprised now it costs 70$/month.

In case anyone wants to subscribe to z.ai with 10% discount [1] * here is the credit campaign rules * [2]

- [1] https://z.ai/subscribe?ic=MW6H74HAZ0

- [2] https://docs.z.ai/devpack/credit-campaign-rules

link

mettamage 60 days ago

Hmm

Will try it out. Thanks for sharing!

link

abustamam 61 days ago

What is your workflow? Do you use Cursor or another tool for code Gen?

link

mikenew 61 days ago

I use Opencode, both directly and through Discord via a little bridge called Kimaki.

https://github.com/remorses/kimaki

link