| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alex7o 61 days ago
	Ok I find it funny that people compare models and are like, opus 4.7 is SOTA and is much better etc, but I have used glm 5.1 (I assume this comes form them training on both opus and codex) for things opus couldn't do and have seen it make better code, haven't tried the qwen max series but I have seen the local 122b model do smarter more correct things based on docs than opus so yes benchmarks are one thing but reality is what the modes actually do and you should learn and have the knowledge of the real strengths that models posses. It is a tool in the end you shouldn't be saying a hammer is better then a wrench even tho both would be able to drive a nail in a piece of wood.

17 comments

mikenew 60 days ago

GLM 5.1 was the model that made me feel like the Chinese models had truly caught up. I cancelled my Claude Max subscription and genuinely have not missed it at all.

Some people seem to agree and some don't, but I think that indicates we're just down to your specific domain and usage patterns rather than the SOTA models being objectively better like they clearly used to be.

operatingthetan 60 days ago

It seems like people can't even agree which SOTA model is best at any given moment anymore, so yeah I think it's just subjective at this point.

fwipsy 60 days ago

Perhaps not even necessarily subjective, just performance is highly task-dependent and even variable within tasks. People get objectively different experiences, and assume one or another is better, but it's basically random.

easygenes 60 days ago

Unless you're looking at something like a pass@100 benchmark, the benchmarks are confounded heavily by a likelihood of a "golden path" retrieval within their capabilities. This is on top of uncertainties like how well your task within a domain maps to the relevant test sets, as well as factors like context fullness and context complexity (heavy list of relevant complex instructions can weigh on capabilities in different ways than e.g. having a history where there's prior unrelated tasks still in context).

The best tests are your own custom personal-task-relevant standardized tests (which the best models can't saturate, so aiming for less than 70% pass rate in the best case).

All this is to say that most people are not doing the latter and their vibes are heavily confounded to the point of being mostly meaningless.

make3 59 days ago

The pass@100 is such a weird critique angle that is surprisingly mainstream; guess what, no one cares if the correct answer is in the top 100, it needs to be the top 1. A model with a better answer in the top 1 is a better model, full stop.

mentalgear 60 days ago

This. Plus if you want to even attempt measuring real 'intelligence' you want to run a neuro-symbolic, de-lexicalized benchmark (e.g. DL-ReasonSuite, SoLT, GSM-Symbolic) - which none of the providers releasing new models showcase.

operatingthetan 60 days ago

>just performance is highly task-dependent and even variable within tasks. People get objectively different experiences, and assume one or another is better, but it's basically random.

You are right that this is not exactly subjectivity, but I think for most people it feels like it. We don't have good benchmarks (imo), we read a lot about other people's experiences, and we have our own. I think certain models are going to be objectively better at certain tasks, it's just our ability to know which currently is impaired.

Ladioss 60 days ago

SOTA models war is the new console war.

But more seriously, I can't help but be amused by how emotionally invested in their AI brand of choice people are getting.

ulfw 60 days ago

AI is a complete commodity

One model can replace another at any given moment in time.

It's NOT a winner-takes-all industry

and hence none of the lofty valuations make sense.

the AI bubble burst will be epic and make us all poorer. Yay

StilesCrisis 60 days ago

Staying power is probably the most important factor, which is why I'm thinking Google eventually takes the crown.

api 60 days ago

They might be converging somewhat. The ultimate limiting factor is training data. Eventually I think they will converge and then the competition will be on memory and compute efficiency, with the best being the smallest maximally capable model.

hamdingers 60 days ago

And the subjectivity is bidirectional.

People judge models on their outputs, but how you like to prompt has a tremendous impact on those outputs and explains why people have wildly different experiences with the same model.

scotty79 60 days ago

I had one occasion where GLM 5.1 did about 95% of the implementation that I needed but couldn't progress form there. And Codex (free quota) solved the remaining 5% on the spot. I'm super happy with both. I don't touch anything Anthropic with a 10 foot pole.

DeathArrow 60 days ago

>GLM 5.1 was the model that made me feel like the Chinese models had truly caught up. I cancelled my Claude Max subscription and genuinely have not missed it at all.

GLM 5.1 is pretty good but there are some "buts".

They hiked the prices 2 times this year. I subscribed to the pro coding plan just before the last hike. At the start of the year, they had only 5 hours quota and no weekly quota. And I hit the weekly quota hard. I can't upgrade the subscription to get a higher weekly quota because they jacked up the prices a lot recently.

My $30 subscription costs now $72. Previously was $15. Max was $49,then $80 and now $160.

_blk 60 days ago

What hardware do you run it on? Trying to consider the cost of subscription + API vs new HW..

_s_a_m_ 58 days ago

I used GLM 5.1 and it was bad, I have no clue why people claim it is good

LoganDark 60 days ago

The value in Claude Code is its harness. I've tried the desktop app and found it was absolutely terrible in comparison. Like, the very nature of it being a separate codebase is already enough to completely throw off its performance compared to the CLI. Nuts.

deaux 60 days ago

> The value in Claude Code is its harness

If this was the case then Anthropic would be in a very bad spot.

It's not, which is why people got so mad about being forced to use it rather than better third party harnesses.

Pi is better than CC as a harness in almost every respect.

enochthered 60 days ago

Anthropic limiting Claude subs to Claude code is what pushed me away in the end because I wanted to keep using Pi.

strel0k1 60 days ago

Just sign up for an AWS account and use the Anthropic models through Bedrock which Pi can use.

seunosewa 60 days ago

API costs are really high compared to subs.

adrianN 60 days ago

Why use tricks to support a company that is hostile to your use case?

deaux 60 days ago

What advantage are you saying this has compared to just directly going through the Anthropic provider? They are the same price.

bizzletk 60 days ago

Can you enumerate why?

deaux 60 days ago

- Claude Code has repeatedly had enormous token wastage bugs. Its agent interactions are also inefficient. These are the cause of many of the reports of "single prompt blew through 5-hour quota" even though it's a reasonable prompt.

- It still lacks support for industry standards such as AGENTS.md

- Extremely limited customization

- Lots of bugs including often making it impossible to view pre-compaction messages inside Claude Code.

- Obvious one: can't easily switch between Claude and non-Claude models

- Resource usage

More than anything, I haven't found a single thing that Pi does worse. All of it is just straight up better or the same.

Mashimo 60 days ago

I thought the desktop app used the cli app in the background?

vidarh 60 days ago

I feel like it's Sonnet level for implementation, but not matching up to Opus for planning.

But I agree it's close enough that it's worth using heavily. I've not cancelled my Claude Max subscription, but I've added a z.ai subscription...

alfonsodev 60 days ago

My combo is codex and claude basic subscription for planing the hard tasks (if any) opencode with GLM 5.1 (z.ai coding plan) for the actual coding.

opencode is awesome I don't miss cluade or codex cli at all, and the z.ai plan is way more generous in compression.

I was lucky to subscribe to z.ai coding plan pro when it costed 30$/month, I was surprised now it costs 70$/month.

In case anyone wants to subscribe to z.ai with 10% discount [1] * here is the credit campaign rules * [2]

- [1] https://z.ai/subscribe?ic=MW6H74HAZ0

- [2] https://docs.z.ai/devpack/credit-campaign-rules

mettamage 60 days ago

Hmm

Will try it out. Thanks for sharing!

abustamam 60 days ago

What is your workflow? Do you use Cursor or another tool for code Gen?

mikenew 60 days ago

I use Opencode, both directly and through Discord via a little bridge called Kimaki.

https://github.com/remorses/kimaki

bink-lynch 60 days ago

I have been using GLM-5.1 with pi.dev through Ollama Cloud for my personal projects and I am very happy with this setup. I use pi.dev with Claude Sonnet/Opus 4.6 at work. Claude Code is great but the latest update has me compacting so much more frequently I could not stand it. I don't miss MCP tool calling when I am using pi.dev; it uses APIs just fine. I actually think GML-5.1 builds better websites than Claude Opus. For my personal projects I am building a full stack development platform and GLM-5.1 is doing a fantastic job.

zackify 60 days ago

I'm using pi the same as you. However, I have an MCP I need to use and the popular extension for that support works fine for me.

Really liking pi and glm 5.1!

jadbox 60 days ago

Why use ollama cloud versus like Openrouter?

bink-lynch 60 days ago

The limits seem higher on Ollama Cloud to me than paying for API access. I don't have solid stats on that though. I have an OpenRouter account and the service I am creating is going to need to use that. I will have better measuring stick then.

zackify 60 days ago

Recently it had great limits but this month I'm trying open router directly.

jxmesth 61 days ago

The only reason I'm stuck with Claude and Chatgpt is because of their tool calling. They do have some pretty useful features like skills etc. I've tried using qwen and deepseek but they can't even output documents. How are you guys handling documents and excels with these tools? I'd love to switch tbh.

embedding-shape 61 days ago

> I've tried using qwen and deepseek but they can't even output documents

What agent harness did you use? Usually, "write_file", "shell_exec" or similar is two of the first tools you add to an agent harness, after read_file/list_files. If it doesn't have those tools, unsure if you could even call it a agent harness in the first place.

jxmesth 61 days ago

Sorry for the confusion, I was actually talking about their Web based chat. Since most of my work is governance and docs, I just use their Web chats and they just refuse to output proper documents like Claude or Chatgpt do.

embedding-shape 61 days ago

Aha... Well, I let Codex (Claude Code would work too) manage/troubleshoot .xlsx files too, seems to handle it just fine (it tends to un-archive them and browse the resulting XML files without issues), seen it do similar stuff for .app and .docx files too so maybe give that a try with other harnesses/models too, they might get it :)

jxmesth 60 days ago

Yeah, it's just way easier to do via the web/mobile app but I'll give using it via the CLI a try. Thanks :)

make3 59 days ago

there's things like Open Web UI that allow you to easily get a chat UI from an open source model

noduerme 61 days ago

You're not giving an AI command line access to your work computer? How do you expect to keep up? /s

dymk 61 days ago

You give it command line access in a VM...

noduerme 60 days ago

Yeah, fine... but it's like daily that a non-tech-savvy friend of mine tells me they just installed some shiny "harness" on their laptop now to organize their emails, and they "just put it in one folder" and "8n8 says", what does it say on the tin, Dave? "it says it's highly unlikely it will escape from the folder". Your work computer? "Yeah, but it's a real company. They're all about security."

So telling someone who just wants to upload an .xlsx file to a bot that they should just find a harness to give CLI access to their work computer - right after they say they work in a regulatory capacity - is just freakin malpractice.

ycui1986 60 days ago

i give it in real ubuntu, no vm, no docker. so long I don't ask it to organize files, it will behave. it has not screw me so far.

koen_hendriks 61 days ago

You mean a VM like the one that contains a 0day that can escape the sandbox that gets found every year at pwn2own?

chillfox 60 days ago

You can make a harness fully functional with just the "shell_exec" tool if you give it access to a linux/unix environment + playwright cli.

ecocentrik 61 days ago

When was the last time you used Qwen models? Their 3.5 and 3.6 models are excellent with tool calling.

jxmesth 61 days ago

I gave it a try a few weeks ago tbh, I'll give it another shot tho. I mainly use their Web chats since that's easier to use and previously, qwen, deepseek, kimi, all were unable to output proper docx files or use skills.

ecocentrik 61 days ago

Try loading the models up in a coding harness like Claude Code. There's a few docx skills listed on Vercel's skill index.

https://skills.sh/tfriedel/claude-office-skills/docx

ycui1986 60 days ago

outputting docx files does not have much to do with model capability. it is about whether tool calling has be configured .

estimator7292 61 days ago

You can use both codex and Claude CLI with local models. I used codex with Gemma4 and it did pretty well. I did get one weird session where the model got confused and couldn't decide which tools actually existed in its inventory, but usually it could use tools just fine.

zrn900 60 days ago

You can just use Cline in VSCode to get most of the tooling you need - it works with all models. Including Xiaomi's new Mimo with 1m context window and blazing fast speed. It's much cheaper than Claude's biggest plan and with much, much more quota.

sscaryterry 61 days ago

You can use GLM-5.1 with claude code directly, I use ccs, GLM-5.1 setup as plan, but goes via API key.

NobleLie 60 days ago

Yep Claude Code CLI does A LOT (which is now confirmed even more)

ycui1986 60 days ago

qwen3.5 and qwen3.6 are both good at tool calling.

jwitthuhn 61 days ago

I've been using qwen-code (the software, not to be confused with Qwen Code the service or Qwen Coder the model) which is a fork of gemini-cli and the tool use with Qwen models at least has been great.

Moosdijk 61 days ago

I wonder why glm is viewed so positively.

Every time I try to build something with it, the output is worse than other models I use (Gemini, Claude), it takes longer to reach an answer and plenty of times it gets stuck in a loop.

pkulak 61 days ago

I've been running Opus and GLM side-by side for a couple weeks now, and I've been impressed with GLM. I will absolutely agree that it's slow, but if you let it cook, it can be really impressive and absolutely on the level of Opus. Keep in mind, I don't really use AI to build entire services, I'm mostly using it to make small changes or help me find bugs, so the slowness doesn't bother me. Maybe if I set it to make a whole web app and it took 2 days, that would be different.

The big kicker for GLM for me is I can use it in Pi, or whatever harness I like. Even if it was _slightly_ below Opus, and even though it's slower, I prefer it. Maybe Mythos will change everything, but who knows.

tasuki 61 days ago

> The big kicker for GLM for me is I can use it in Pi, or whatever harness I like.

Yes, but... isn't the same true for Opus and all the other models too?

slopinthebag 61 days ago

Opus is about 7 times more expensive than GLM with API pricing. And since you can only use the Opus subscription plan in CC, you're essentially locked into API pricing for Pi and any other harness.

So you're either paying $1000's for Opus in Pi, or $30/month for GLM in Pi. If the results are mostly equivalent that's an easy choice for most of us.

tasuki 61 days ago

Perhaps I'm being extremely daft: If the API is 7 times more expensive, then why is it $1000 vs $30? Or is there a GLM subscription one can use with Pi? Certainly not available in my (arguably outdated) Pi.

RussianCow 61 days ago

I'm not the OP, but it's the latter. I'm currently using the "Lite" GLM subscription with OpenCode, for example. I'm not using it very heavily, but I haven't come close to hitting the limits, whereas I burned through my weekly limits with Claude very regularly.

bink-lynch 60 days ago

I am using GLM-5.1 in pi.dev through Ollama Cloud. I am able to get by on the $20 plan. I use it a lot and the reset is hourly for sessions and weekly overall. This is the first week I got close to the limit before reset at about 85% used. I am probably using it about 4 hours a day on average 6 or 7 days per week.

girvo 61 days ago

You can use GLM’s coding plan in Pi, just use the anthropic API instead of the OpenAI compatible one they give.

Mashimo 61 days ago

I have used GLM 4.7, 5 and 5.1 now for about 3 month via OpenCode harness and I don't remember it every being stuck in a loop.

You have to keep it below ~100 000 token, else it gets funny in the head.

I only use it for hobby projects though. Paid 3 EUR per month, that is not longer available though :( Not sure what I will choose end of month. Maybe OpenCode Go.

Mashimo 60 days ago

EDIT: Ok, now I tried GLM for the first time in the morning CET, and it was .. bad. The reasoning took 5 mintues for a very very small .html file going around in circles.

Evening CET experience for me is super smooth.

gck1 61 days ago

That's unfortunate. 70-80k tokens is roughly the point where I start wrapping up with giving agent required context even on the small to medium sized requests.

That would leave almost no tokens for actual work

chillfox 60 days ago

GLM is the first open source model that actually worked for me, where I found the output ok.

And yes, sonnet/opus is better and what I use daily. But I wouldn’t be that upset if I had to drop down to GLM.

Akira1364 61 days ago

IDK about GLM but GPT 5.4 Extra High has been great when I've used it in the VS Code Copilot extension, I see no actual reason Opus should consume 3x more quota than it the way it does

spaceman_2020 61 days ago

I think it offers a very good tradeoff of cost vs competency

4.7 is better, but its also wildly expensive

slopinthebag 61 days ago

You're probably just holding it wrong.

ternaryoperator 61 days ago

The models test roughly equal on benchmarks, with generally small differences in their scores. So, it’s reasonable to choose the model based on other criteria. In my case, I’d switch to any vendor that had a decent plugin for JetBrains.

ezekiel68 61 days ago

Qwen3-Coder produced much better rust code (that utilized rust's x86-64 vectorized extensions) a few months ago than Claude Opus or Google Gemini could. I was calling it from harnesses such as the Zed editor and trae CLI.

I was very impressed.

gck1 61 days ago

I think claude in general, writes very lazy, poor quality code, but it writes code that works in fewer iterations. This could be one of the reasons behind it's popularity - it pushes towards the end faster at all costs.

Every time codex reviews claude written rust, I can't explain it, but it almost feels like codex wants to scream at whoever wrote it.

lambda 60 days ago

Their latest, Qwen3.6 35B-A3B is quite capable, and fast and small enough I don't really feel constrained running it locally. Some of the others that I've run that seem reasonably good, like Gemma 4 31B and Qwen3.5 122B-A10B just feel a bit too slow, or OOM my system too often, or run up on cache limits so spend a lot of time re-processing history. But the latest Qwen3.6 is both quite strong, and lightweight enough that it feels usable on consumer hardware.

justincormack 61 days ago

Codex is pretty good at Rust with x86 and arm intrinsics too, it replaced a bunch of hand written C/assembly code I was using. I will try Qwen and Kimi on this kind of task too.

sirnicolaz 61 days ago

Consider that SWE benchmarking is mainly done with python code. It tells something

blurbleblurble 60 days ago

Opus 4.6 was incredible but Opus 4.7 is genuinely frustrating to me so far. It's really sharp but can be so lazy. It's constantly telling me that we should save this for tomorrow, that it's time for bed (in the middle of the day), and very often quite sloppy and bold in its action. These adjustments are getting old. The next crop of open models seems ready to practically replace the big ones as sharp orchestrator agents.

make3 59 days ago

I had to write multiple times in my prompt that it's not the model's role to change the subject or end the conversation at all.

I think that they do that to dodge conversations about controversial subjects without full-on refusing to answer. They'll give you an ok answer then tell you to go to get the walk you were talking about.

I also feel like maybe they think people are still ready to pay a lot if they feel like they're getting a lot of "high value stuff" even if the low value stuff the model refuses to do, so they basically try to stop you from doing low value stuff on Opus. I suspect that Sonnet or Haiku never tells you to go take a hike.

chillfox 60 days ago

I have never seen a model be “lazy” before (I have seen them go for minimal change). I have been using the models through the api with various agents and no custom system prompt.

So I am curious, how do people get these lazy outputs?

Is it by having one of those custom system prompts that basically tells the model to be disrespectful?

Or is it free tier?

Cheap plans?

enraged_camel 60 days ago

I have seen some people complain about a new tendency where it can suggest wrapping up the current task even though it isn't done yet. I haven't seen it myself though.

solenoid0937 60 days ago

Usually this gets worse if you have a phrase like "wrap it up" earlier in the output, or if you're at a few hundred thousand tokens without compacting.

In both cases the fix is really simple, just compact.

cornedor 61 days ago

I tried GLM and Qwen last week for a day. And some issues it could solve, while some, on surface relatively easy, task it just could not solve after a few tries, that Opus oneshotted this morning with the same prompt. It’s a single example ofcourse, but I really wanted to give it a fair try. All it had to do was create a sortable list in Magento admin. But on the other hand, GLM did oneshot a phpstorm plugin

dev_l1x_be 61 days ago

Do you use Opus through the API or with subscription? Did you use OpenCode or Code?

cornedor 61 days ago

Opus trough Claude Code, the Chinese models trough OpenCode Go, which seems like a great package to test them out.

odie5533 61 days ago

If you showed me code from GLM 5.1, Opus 4.6, and Kimi K2.6, my ranking for best model would be highly random.

FlyingSnake 61 days ago

I tried GLM5.1 last week after reading about it here. It was slow as molasses for routine tasks and I had to switch back to Claude. It also ran out of 5H credit limit faster than Claude.

bensyverson 61 days ago

If you view the "thinking" traces you can see why; it will go back and forth on potential solutions, writing full implementations in the thinking block then debating them, constantly circling back to points it raised earlier, and starting every other paragraph with "Actually…" or "But wait!"

nothinkjustai 61 days ago

I see this with Opus too.

girvo 61 days ago

Indeed. And that’s with Anthropic hiding reading traces unlike these other comparisons.

FlyingSnake 61 days ago

> "Actually…" or "But wait!"

You’re absolutely right!

Jokes apart, I did notice GLM doing these back and forth loops.

tonyarkles 61 days ago

I was watching Qwen3.6-35B-A3B (locally) doing the same dance yesterday. It eventually finished and had a reasonable answer, but it sure went back and forth on a bunch of things I had explicitly said not to do before coming to a conclusion. At least said conclusion was not any of the things I'd said not to do.

Lerc 61 days ago

That is essentially what the reasoning reinforcement training does. It is getting the model to say things that are more likely to result in the correct final answer. Everything it does in between doesn't necessarily need to be valid argument to produce the answer. You can think of it as filling the context with whatever is needed to make the right answer come out next. Valid arguments obviously help. but so might expressions of incorrect things that are not obviously untrue to the model until it sees them written out. The What's The Magic Word paper shows how far that could go. If the policy model managed to learn enough magic words it would be theoretically possible to end up with an LLM that spouts utter gibberish until delivering the correct answer seemingly out of the blue.

tonyarkles 61 days ago

That's pretty cool, thanks for the extra context! (pardon the... not even pun I guess)

Also, thanks for pointing me at that specific paper; I spend a lot more of my life closer to classical control theory than ML theory so it's always neat to see the intersection of them. My unsubstantiated hypothesis is that controls & ML are going to start getting looked at more holistically, and not in the way I normal see it (which is "why worry about classical control theory, just solve the problem with RL"). Control theory is largely about steering dynamic systems along stable trajectories through state space... which is largely what iterative "fill in the next word" LLM models are doing. The intersection, I hope, will be interesting and add significant efficiency.

nothinkjustai 61 days ago

Z.ai’s cloud offering is poor, try it with a different provider.

complexworld 60 days ago

could you add some context for why you think it's poor?

vidarh 60 days ago

I don't find GLM 5.1 beating Opus personally, but I do think it is good enough to consider it part of the SOTA pack at this point. It feels like it needs more time and tokens to achieve things, but that's okay - it's so much cheaper per token.

If Qwen3.6-Max is up there as well, it will be very interesting.

dev_l1x_be 61 days ago

Benchmarking is grossly misleading. Claude’s subscription with Code would not score this high on the benchmarks because how they lobotomized agentic coding.

solomatov 61 days ago

>but I have seen the local 122b model do smarter more correct things based on docs than opus

Could you please share more about this

alex7o 61 days ago

Maybe a bit misleading. I have used in in two places.

One Is for local opencode coding and config of stuff the other is for agent-browser use and for both it did better (opus 4.6) for the thing I was testing atm. The problem with opus at the moment I tired it was overthinking and moving itself sometimes I the wrong direction (not that qwen does overthink sometimes). However sometimes less is more - maybe turning thinking down on opus would have helped me. Some people said that it is better to turn it of entirely when you start to impmenent code as it already knows what it needs to do it doesn't need more distraction.

Another example is my ghostty config I learned from queen that is has theme support - opus would always just make the theme in the main file

flyingsquirrel_ 60 days ago

GLM 5.1 is sometimes better than Opus 4.7. So it made me buy GLM coding plan.

mkhalil 60 days ago

Not to mention, that Opus cost orders of magnitude more money. These are VERY impressive and usage.

FAANGS love to give away money to get people addicted to their platforms, and even they, the richest companies in the world, are throttling or reducing Opus usage for paying members, because even the money we pay them doesn't cover it.

Meanwhile, these are usable on local deployments! (and that's with the limited allowance our AI overlords afford us when it comes to choices for graphics cards too!)

OtomotO 61 days ago

Many people averted religion (which I can get behind with), but have never removed the dogmatic thinking that lay at its root.

As so many things these days: It's a cult.

I've used Claude for many months now. Since February I see a stark decline in the work I do with it.

I've also tried to use it for GPU programming where it absolutely sucks at, with Sonnet, Opus 4.5 and 4.6

But if you share that sentiment, it's always a "You're just holding it wrong" or "The next model will surely solve this"

For me it's just a tool, so I shrug.

balls187 61 days ago

> I've used Claude for many months now. Since February I see a stark decline in the work I do with it.

I find myself repeating the following pattern: I use an AI model to assist me with work, and after some time, I notice the quality doesn't justify the time investment. I decide to try a similar task with another provider. I try a few more tests, then decide to switch over for full time work, and it feels like it's awesome and doing a good job. A few months later, it feels like the model got worse.

runarberg 61 days ago

I wonder about this. I see two obvious possibilities (if we ignore bias):

1. The models are purposefully nerfed, before the release of the next model, similar to how Apple allegedly nerfed their older phones when the next model was out.

2. You are relying more and more on the models and are using your talent less and less. What you are observing is the ratio of your vs. the model’s work leaning more and more to the model’s. When a new model is released, it produces better quality code then before, so the work improves with it, but your talent keeps deteriorating at a constant rate.

ehnto 61 days ago

I definitely find your last point is true for me. The more work I am doing with AI the more I am expecting it to do, similar to how you can expect more over time from a junior you are delegating to and training. However the model isn't learning or improving the same way, so your trust is quickly broken.

As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.

tonyarkles 61 days ago

> However the model isn't learning or improving the same way, so your trust is quickly broken.

One other failure mode that I've seen in my own work while I've been learning: the things that you put into AGENTS.md/CLAUDE.md/local "memories" can improve performance or degrade performance, depending on the instructions. And unless you're actively quantitatively reviewing and considering when performance is improving or degrading, you probably won't pick up that two sentences that you added to CLAUDE.md two weeks ago are why things seem to have suddenly gotten worse.

> similar to how you can expect more over time from a junior you are delegating to and training

That's the really interesting bit. Both Claude and Codex have learned some of my preferences by me explicitly saying things like "Do not use emojis to indicate task completion in our plan files, stick to ASCII text only". But when you accidentally "teach" them something that has a negative impact on performance, they're not very likely to push back, unlike a junior engineer who will either ignore your dumb instruction or hopefully bring it up.

> As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.

That is definitely a thing too. There have been a few times that I have "let my guard down" so to speak and haven't deeply considered the implications of every commit. Usually this hasn't been a big deal, but there have been a few really ugly architectural decisions that have made it through the gate and had to get cleaned up later. It's largely complacency, like you point out, as well as burnout trying to keep up with reviewing and really contemplating/grokking the large volume of code output that's possible with these tools.

svnt 61 days ago

Your version of the last point is a bit softer I think — parent was putting it down to “loss of talent” but yours captures the gaps vs natural human interaction patterns which seems more likely, especially on such short timescales.

runarberg 61 days ago

I confusingly say both. First I say that the ratio of work coming from the model is increasing, and when I am clarifying I say “your talent keeps deteriorating”. You correctly point out these are distinct, and maybe this distinction is important, although I personally don‘t think so. The resulting code would be the same either way.

Personally I can see the case for both interpretation to be true at the same time, and maybe that is precisely why I confused them so eagerly in my initial post.

rescbr 61 days ago

I don’t think the providers intentionally nerf the models to make the new one look better. It’s a matter of them being stingy with infrastructure, either by choice to increase profit and/or sheer lack of resources to keep n+1 models deployed in parallel without deprecating older ones when a new one is released.

I’d prefer providers to simply deprecate stuff faster, but then that would break other people’s existing workflows.

flux3125 61 days ago

Point 2 is so true, I definitely find myself spending more time reading code vs writing it. LLMs can teach you a lot, but it's never the same as actually sitting down and doing it yourself.

e12e 61 days ago

I think it might have to do with how models work, and fundamental limits with them (yes, they're stochastic parrots, yes they confabulate).

Newer (past two years?) models have improved "in detail" - or as pragmatic tools - but they still don't deserve the anthropomorphism we subject them to because they appear to communicate like us (and therefore appear to think and reason, like us).

But the "holes" are painted over in contemporary models - via training, system prompts and various clever (useful!) techniques.

But I think this leads us to have great difficulty spotting the weak spots in a new, or slightly different model - but as we get to know each particular tool - each model - we get better at spotting the holes on that model.

Maybe it's poorly chosen variable names. A tendency to write plausible looking, plausibly named, e2e tests that turns out to not quite test what they appear to test at first glance. Maybe there's missing locking of resources, use of transactions, in sequencial code that appear sound - but end up storing invalid data when one or several steps fail...

In happy cases current LLMs function like well-intentioned junior coders enthusiasticly delivering features and fixing bugs.

But in the other cases, they are like patholically lying sociopaths telling you anything you want to hear, just so you keep paying them money.

When you catch them lying, it feels a bit like a betrayal. But the parrot is just tapping the bell, so you'll keep feeding it peanuts.

taurath 61 days ago

I agree - the problem is it’s hard to see how people who say they’re using it effectively actually are using it, what they’re outputting, and making any sort of comparison on quality or maintainability or coherence.

In the same way, it’s hard to see how people who say they’re struggling are actually using it.

There’s truth somewhere in between “it’s the answer to everything” and “skill issue”. We know it’s overhyped. We know that it’s still useful to some extent, in many domains.

balls187 61 days ago

Well summarized.

We're also seeing that the people up top are using this to cull the herd.

taneq 61 days ago

I wonder to what degree it depends on how easy you find coding in general. I find for the early steps genAI is great to get the ball rolling, but rapidly it becomes more work to explain what it did wrong and how to fix it (and repeat until it does so) than to just fix the code myself.

slopinthebag 60 days ago

Yes, this and also taste. What might be perfectly fine for one developer is an abomination for another who can spot the problems with it.

I think in every domain, the better you are the less useful you find AI.

psychoslave 61 days ago

What is it that is dogma free? If one goes hardcore pyrrhonism, doubting that there is anything currently doubting as this statement is processed somehow, that is perfectly sound.

At some point the is a need to have faith in some stable enough ground to be able to walk onto.

Wolfbeta 61 days ago

Who controls that need for you?

ecshafer 61 days ago

All people think dogmatically. The only difference is what the ontological commitments and methaphysical foundations are. Take out God and people will fit politics, sports teams, tools, whatever in there. Its inescapable.

smallmancontrov 61 days ago

All people think dogmatically, but religion does not prevent people from acting dogmatically in politics, sports, etc. It just doesn't. It never did.

Under normal circumstances I'd consider this a nit and decline to pick it, but the number of evangelists out there arguing the equivalent of "cure your alcohol addiction with crystal meth!" is too damn high.

bensyverson 61 days ago

Allow me to introduce you to Buddhism

ecshafer 61 days ago

Elaborate. Buddhism is going to have the same epistemological issues as anything, since its a human consciousness issue.

bensyverson 61 days ago

> since its a human consciousness issue

I'd encourage you to check it out for yourself. It's certainly possible to be a dogmatic Buddhist, but one of the foundational beliefs of Buddhism is that the type of dogmatic attachment you're describing is avoidable. It's not easy, but that's why you meditate.

tauroid 61 days ago

https://en.wikipedia.org/wiki/Prat%C4%ABtyasamutp%C4%81da

svnt 61 days ago

Which one?

bensyverson 61 days ago

Zen

svnt 61 days ago

The Western Zen? In my experience it is downgraded from being a religion to being a system of practice which relieves it of the broader Mahayana cosmology. But I would suggest the dogma is less obvious but still there, often just somewhere else, such as in its own limitations, or in a philosophical container at a higher level such as scientism.

OtomotO 61 days ago

Dogmatism is a spectrum and for too many people it's on the animal side of the scale.