Hacker News new | ask | show | jobs
by CuriouslyC 245 days ago
GPT5 may have been underwhelming to _you_. Understand that they're heavily RLing to raise the floor on these models, so they might not be magically smarter across the board, there are a LOT of areas where they're a lot better that you've probably missed because they're not your use case.
1 comments

every time i say "the tech seems to be stagnating" or "this model seems worse" based on my observations i get this response. "well, it's better for other use cases." i have even heard people say "this is worse for the things i use it for, but i know it's better for things i don't use it for."

i have yet to hear anyone seriously explain to me a single real-world thing that GPT5 is better at with any sort of evidence (or even anecdote!) i've seen benchmarks! but i cannot point to a single person who seems to think that they are accomplishing real-world tasks with GPT5 better than they were with GPT4.

the few cases i have heard that venture near that ask may be moderately intriguing, but don't seem to justify the overall cost of building and running the model, even if there have been marginal or perhaps even impressive leaps in very narrow use cases. one of the core features of LLMs is they are allegedly general-purpose. i don't know that i really believe a company is worth billions if they take their flagship product that can write sentences, generate a plan, follow instructions and do math and they are constantly making it moderately better at writing sentences, or following instructions, or coming up with a plan and it consequently forgets how to do math, or becomes belligerent, or sycophantic, or what have you.

to me, as a user with a broad range of use cases (internet search, text manipulation, deep research, writing code) i haven't seen many meaningful increases in quality of task execution in a very, very long time. this tracks with my understanding of transformer models, as they don't work in a way that suggests to me that they COULD be good at executing tasks. this is why i'm always so skeptical of people saying "the big breakthrough is coming." transformer models seem self-limiting by merit of how they are designed. there are features of thought they simply lack, and while i accept there's probably nobody who fully understands how they work, i also think at this point we can safely say there is no superintelligence in there to eke out and we're at the margins of their performance.

the entire pitch behind GPT and OpenAI in general is that these are broadly applicable, dare-i-say near-AGI models that can be used by every human as an assistant to solve all their problems and can be prompted with simple, natural language english. if they can only be good at a few things at a time and require extensive prompt engineering to bully into consistent behavior, we've just created a non-deterministic programming language, a thing precisely nobody wants.

The simple explanation for all this, along with the milquetoast replies kasey_junk gave you, is that to its acolytes, AI and LLMs cannot fail, only be failed.

If it doesn't seem to work very well, it's because you're obviously prompting it wrong.

If it doesn't boost your productivity, either you're the problem yourself, or, again, you're obviously using it wrong.

If progress in LLMs seems to be stagnating, you're obviously not part of the use cases where progress is booming.

When you have presupposed that LLMs and this particular AI boom is definitely the future, all comments to the contrary are by definition incorrect. If you treat it as a given that this AI boom will succeed (by some vague metric of "success") and conquer the world, skepticism is basically a moral failing and anti-progress.

The exciting part about this belief system is how little you actually have to point to hard numbers and, indeed, rely on faith. You can just entirely vibe it. It FEELS better and more powerful to you, your spins on the LLM slot machine FEEL smarter and more usable, it FEELS like you're getting more done. It doesn't matter if those things are actually true over the long run, it's about the feels. If someone isn't sharing your vibes about the LLM slot machine, that's entirely their fault and problem.

And on the other side, to detractors, AI and LLMs cannot ever succeed. There's always another goalpost to shift.

If it seems to work well, it's because it's copying training data. Or it sometimes gets something wrong, so it's unreliable.

If they say it boosts their productivity, they're obviously deluded as to where they're _really_ spending time, or what they were doing was trivial.

If they point to improvements in benchmarks, it's because model vendors are training to the tests, or the benchmarks don't really measure real-world performance.

If the improvements are in complex operations where there aren't benchmarks, their reports are too vague and anecdotal.

The exciting part about this belief system is how little you have to investigate the actual products, and indeed, you can simply rely on a small set of canned responses. You can just entirely dismiss reports of success and progress; that's completely due to the reporter's incompetence and self-delusion.

I work in a company that's "all in on AI" and there's so much BS being blown up just because they can't have it fail because all the top dogs will have mud on their faces. They're literally just faking it. Just making up numbers, using biased surveys, making sure employees know it's being "appreciated" if they choose option A "Yes AI makes me so much more productive" etc.

This is definitely something that biases me against AI, sure. Seeing how the sausage is made doesn't help. Because it's really a lot of offal right now especially where I work.

I'm a very anti-corporate non-teamplayer kinda person so I tend to be highly critical, I'll never just go along with PR if it's actually false. I won't support my 'team' if it's just wrong. Which often rubs people the wrong way at work. Like when I emphasised in a training that AI results must be double checked. Or when I answered in an "anonymous" survey that I'd rather have a free lunch than "copilot" and rated it a 2 out of 5 in terms of added value (I mean, at the time it didn't even work in some apps)

But I'm kinda done with soul-killing corporatism anyway. Just waiting for some good redundancy packages when the AI bubble collapses :)

> If they say it boosts their productivity, they're obviously deluded as to where they're _really_ spending time, or what they were doing was trivial.

A pretty substantial number of developers are doing trivial edits to business applications all over the globe, pretty much continuously. At least in the low to mid double digits %

wouldn't call myself a detractor. i wouldn't call it a belief system i hold (i am an engineer 20 years into my career and would love to automate away the tedious parts of my job i've done a thousand times) as it is a position i hold based on the evidence i've seen in front of me.

i constantly hear that companies are running with "50% of their code written by AI!" but i've yet to meet an engineer who says they've personally seen this. i've met a few who say they see it through internal reporting, though it's not the case on their team. this is me personally! i'm not saying these people don't exist. i've heard it much more from senior leadership types i've met in the field - directors, vps, c-suite, so on.

i constantly hear that AI can do x, y, or z, but no matter how many people i talk to or how much i or my team works towards those goals, it doesn't really materialize. i can accept that i may be too stupid (though i'd argue that if that's the problem, the AI isn't as good as claimed) but i work with some brilliant people and if they can't see results, that means something to me.

i see people deploying the tool at my workplace, and recently had to deal with a situation where leadership was wondering why one of our top performers had slowed down substantially and gotten worse, only to find that the timeline exactly aligned with them switching to cursor as their IDE.

i read papers - lots of papers - and articles about both positive and negative assertions about LLMs and their applicability in the field. i don't feel like i've seen compelling evidence in research not done by the foundation model companies that supports the theory this is working well. i've seen lots of very valid and concerning discoveries reported by the foundation model companies, themselves!

there are many places in the world i am a hardliner on no generative AI and i'll be open about that - i don't want it in entertainment, certainly not in music, and god help me if i pick up the phone and call a company and an agent picks up.

for my job? i'm very open to it. i know the value i provide above what the technology could theoretically provide, i've written enough boilerplate and the same algorithms and approaches for years to prove to myself i can do it. if i can be as productive with less work, or more productive with the same work? bring it on. i am not worried about it taking my job. i would love it to fulfill its promise.

i will say, however, that it is starting to feel telling that when i lay out any sort of reasoned thought on the issue that (hopefully) exposes my assumptions, biases, and experiences, i largely get vague, vibes-based answers, unsourced statistics, and responses that heavily carry the implication that i'm unwilling to be convinced or being dogmatic. i very rarely get thoughtful responses, or actual engagement with the issues, concerns, or patterns i write about. oftentimes refutations of my concerns or issues with the tech are framed as an attack on my willingness to use or accept it, rather than a discussion of the technology on its merits.

while that isn't everything, i think it says something about the current state of discussion around the technology.

You really thought you had a post with this one huh. I have second-hand embarrassment for you.
Claude Sonnet 4.5 is _way_ better than previous sonnets and as good as Opus for the coding and research tasks I do daily.

I rarely use Google search anymore, both because llms got that ability embedded and the chatbots are good at looking through the swill search results have become.

"it's better at coding" is not useful information, sorry. i'd love to hear tangible ways it's actually better. does it still succumb to coding itself in circles, taking multiple dependencies to accomplish the same task, applying inconsistent, outdated, or non-idiomatic patterns for your codebase? has compliance with claude.md files and the like actually improved? what is the round trip time like on these improvements - do you have to have a long conversation to arrive at a simple result? does it still talk itself into loops where it keeps solving and unsolving the same problems? when you ask it to work through a complex refactor, does it still just randomly give up somewhere in the middle and decide there's nothing left to do? does it still sometimes attempt to run processes that aren't self-terminating to monitor their output and hang for upwards of ten minutes?

my experience with claude and its ilk are that they are insanely impressive in greenfield projects and collapse in legacy codebases quickly. they can be a force multiplier in the hands of someone who actually knows what they're doing, i think, but the evidence of that even is pretty shaky: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

the pitch that "if i describe the task perfectly in absolute detail it will accomplish it correctly 80% of the time" doesn't appeal to me as a particularly compelling justification for the level of investment we're seeing. actually writing the code is the simplest part of my job. if i've done all the thinking already, i can just write the code. there's very little need for me to then filter that through a computer with an overly-verbose description of what i want.

as for your search results issue: i don't entirely disagree that google is unusable, but having switched to kagi... again, i'm not sure the order of magnitude of complexity of searching via an LLM is justified? maybe i'm just old, but i like a list of documents presented without much editorializing. google has been a user-hostile product for a long time, and its particularly recent quality collapse has been well-documented, but this seems a lot more a story of "a tool we relied on has gotten measurably worse" and not a story of "this tool is meaningfully better at accomplishing the same task." i'll hand it to chatgpt/claude that they are about as effective as google was at directing me to the right thing circa a decade ago, when it was still a functional product - but that brings me back to the point that "man, this is a lot of investment and expense to arrive at the same result way more indirectly."

You asked for a single anecdote of llms getting better at daily tasks. I provided two. You dismissed them as not valuable _to you_.

It’s fine that your preferences aren’t aligned such that you don’t value the model or improvements that we’ve seen. It’s troubling that you use that to suggest there haven’t been improvements.

you didn't provide an anecdote. you just said "it's better." an anecdote would be "claude 4 failed in x way, and claude 4.5 succeeds consistently." "it is better" is a statement of fact with literally no support.

the entire thrust of my statement was "i only hear nonspecific, vague vibes that it's better with literally no information to support that concept" and you replied with two nonspecific, vague vibes. sorry i don't find that compelling.

"troubling" is a wild word to use in this scenario.

My one shot rate for unattended prompts (triggered GitHub actions) has gone from about 2 in 3 to about 4 in 5 with my upgrade to 4.5 in the codebase I program in the most (one built largely pre-ai). These are highly biased to tasks I expect ai to do well.

Since the upgrade I don’t use opus at all for planning and design tasks. Anecdotally, I get the same level of performance on those because I can choose the model and I don’t choose opus. Sonnet is dramatically cheaper.

What’s troubling is that you made a big deal about not hearing any stories of improvements as if your bar was very low for said stories, then immediately raised the bar when given them. It means that one doesn’t know what level of data you actually want.

Here's one. I have a head to head "benchmark" involving generating a React web app to display a Gantt chart, add tasks, layer overlaps, read and write to files, etc. I compared implementing this application using both Claude Code with Opus 4.1 / Sonnet 4 (scenario 1) and Claude Code 2 with Sonnet 4.5 (scenario 2) head to head.

The scenario 1 setup could complete the application but it had about 3 major and 3 minor implementation problems. Four of those were easily fixed by pointing them out, but two required significant back and forth with the model to resolve.

The scenario 2 setup completed the application and there were four minor issues, all of which were resolved with one or two corrective prompts.

Toy program, single run through, common cases, stochastic parrot, yadda yadda, but the difference was noticeable in this direct comparison and in other work I've done with the model I see a similar improvement.

Take from that what you will.

The biggest issue with Sonnet 4.5 is that it's chatty as fuuuck. It just won't shut up, it keeps producing massive markdown "reports" and "summaries" of every single minor change, wasting precious context.

With Sonnet 4 I rarely ran out of quota unexpectedly, but 4.5 chews through whatever little Anthropic gives us weekly.

Gpt5 isn't an improvement to me, but Claude sonnet4.5, handle terragrunt way, way better than the previous version did. It also go search AWS documentation by itself, and parse external documents way better. That's not LLM improvement, to be clear (except the terragrunt thing), I think it's improvement in data acquisition and a better inference engine. On react project it seems way, way less messy also, I have to use it more but the inference engine seems clearer. At least less prone to circular code, where it's stuck in a loop. It seems to be exiting the loop faster, even when the output isn't satisfactory (which isn't an issue to me, most of my prompt have more or less 'only write functions template, do not write the inside logic if it has to contain more than a loop', I fill the blanks myself)