Hacker News new | ask | show | jobs
by joshvince 241 days ago
> if you’re working on novel code, LLMs are absolutely horrible

This is spot on. Current state-of-the-art models are, in my experience, very good at writing boilerplate code or very simple architecture especially in projects or frameworks where there are extremely well-known opinionated patterns (MVC especially).

What they are genuinely impressive at is parsing through large amounts of information to find something (eg: in a codebase, or in stack traces, or in logs). But this hype machine of 'agents creating entire codebases' is surely just smoke and mirrors - at least for now.

6 comments

> at least for now.

I know I could be eating my words, but there is basically no evidence to suggest it ever becomes as exceptional as the kingmakers are hoping.

Yes it advanced extremely quickly, but that is not a confirmation of anything. It could just be the technology quickly meeting us at either our limit of compute, or it's limit of capability.

My thinking here is that we already had the technologies of the LLMs and the compute, but we hadn't yet had the reason and capital to deploy it at this scale.

So the surprising innovation of transformers did not give us the boost in capability itself, it still needed scale. The marketing that enabled the capital, that enables that scale was what caused the insane growth, and capital can't grow forever, it needs returns.

Scale has been exponential, and we are hitting an insane amount of capital deployment for this one technology that, has yet to prove commercially viable at the scale of a paradigm shift.

Are businesses that are not AI based, actually seeing ROI on AI spend? That is really the only question that matters, because if that is false, the money and drive for the technology vanishes and the scale that enables it disappears too.

> Yes it advanced extremely quickly, but that is not a confirmation of anything. It could just be the technology quickly meeting us at either our limit of compute, or it's limit of capability.

To comment om this, because its the most common counter argument. Most technology has worked in steps. We take a step forward, then iterate on essentially the same thing. It's very rare we see order of magnitude improvement on the same fundamental "step".

Cars were quite a step forward from donkeys, but modern cars are not that far off from the first ones. Planes were an amazing invention, but the next model of plane is basically the same thing as the first one.

I agree, I think we are in the latter phase already. LLMs were a huge leap in machine learning, but everything after has been steps on top + scale.

I think we would need another leap to actually meet the markets expectations on AI. The market is expecting AGI, but I think we are probably just going to do incremental improvements for language and multi modal models from here, and not meet those expectations.

I think the market is relying on something that doesn't currently exist to become true, and that is a bit irrational.

Transformers aren't it, though. We need a new fundamental architecture and, just like every step forward in AI that came before, when that happens is a completely random event. Some researcher needs to wake up with a brilliant idea.

The explosion of compute and investment could mean that we have more researchers available for that event to happen, but at the same time transformers are sucking up all the air in the room.

Several people hinted at the limits this technology was about to face, including training data and compute. It was obvious it had serious limits.

Despite the warnings, companies insisted on marketing superintelligence nonsense and magic automatic developers. They convinced the market with disingenous demonstrations, which, again, were called out as bullshit by many people. They are still doing it. It's the same thing.

> Yes it advanced extremely quickly

The things that impress me about gpt-5 are basically the same ones that impressed me about gpt-3. For all the talk about exponential growth, I feel like we experienced one big technical leap forward and have spent the past 5 years fine-tuning the result—as if fiddling with it long enough will turn it into something it is not.

When building their LLMs, the model makers consumed the entire internet. This allowed the models to improve exponentially fast. But there's no more internet to consume. Yes, new data is being generated, but not at anywhere near the rate the models were growing in capability just a year ago. That's why we're seeing diminishing returns when comparing, say, GPT-5 to GPT-4.

The AI marketers, accelerationists and doomers may seem to be different from one another, but the one thing they have in common is their adherence to an extrapolationist fallacy. They've been treating the explosion of LLM capabilities as a promise of future growth and capability, when in fact it's all an illusion. Nothing achieves indefinite exponential growth. Everything hits a wall.

> Yes it advanced extremely quickly,

It did but it's kinda stagnated now especially on the LLM front. The time when ever week a groundbreaking model came out is over for now. Later revisions of existing models, like GPT5 and llama4 have been underwhelming.

GPT5 may have been underwhelming to _you_. Understand that they're heavily RLing to raise the floor on these models, so they might not be magically smarter across the board, there are a LOT of areas where they're a lot better that you've probably missed because they're not your use case.
every time i say "the tech seems to be stagnating" or "this model seems worse" based on my observations i get this response. "well, it's better for other use cases." i have even heard people say "this is worse for the things i use it for, but i know it's better for things i don't use it for."

i have yet to hear anyone seriously explain to me a single real-world thing that GPT5 is better at with any sort of evidence (or even anecdote!) i've seen benchmarks! but i cannot point to a single person who seems to think that they are accomplishing real-world tasks with GPT5 better than they were with GPT4.

the few cases i have heard that venture near that ask may be moderately intriguing, but don't seem to justify the overall cost of building and running the model, even if there have been marginal or perhaps even impressive leaps in very narrow use cases. one of the core features of LLMs is they are allegedly general-purpose. i don't know that i really believe a company is worth billions if they take their flagship product that can write sentences, generate a plan, follow instructions and do math and they are constantly making it moderately better at writing sentences, or following instructions, or coming up with a plan and it consequently forgets how to do math, or becomes belligerent, or sycophantic, or what have you.

to me, as a user with a broad range of use cases (internet search, text manipulation, deep research, writing code) i haven't seen many meaningful increases in quality of task execution in a very, very long time. this tracks with my understanding of transformer models, as they don't work in a way that suggests to me that they COULD be good at executing tasks. this is why i'm always so skeptical of people saying "the big breakthrough is coming." transformer models seem self-limiting by merit of how they are designed. there are features of thought they simply lack, and while i accept there's probably nobody who fully understands how they work, i also think at this point we can safely say there is no superintelligence in there to eke out and we're at the margins of their performance.

the entire pitch behind GPT and OpenAI in general is that these are broadly applicable, dare-i-say near-AGI models that can be used by every human as an assistant to solve all their problems and can be prompted with simple, natural language english. if they can only be good at a few things at a time and require extensive prompt engineering to bully into consistent behavior, we've just created a non-deterministic programming language, a thing precisely nobody wants.

The simple explanation for all this, along with the milquetoast replies kasey_junk gave you, is that to its acolytes, AI and LLMs cannot fail, only be failed.

If it doesn't seem to work very well, it's because you're obviously prompting it wrong.

If it doesn't boost your productivity, either you're the problem yourself, or, again, you're obviously using it wrong.

If progress in LLMs seems to be stagnating, you're obviously not part of the use cases where progress is booming.

When you have presupposed that LLMs and this particular AI boom is definitely the future, all comments to the contrary are by definition incorrect. If you treat it as a given that this AI boom will succeed (by some vague metric of "success") and conquer the world, skepticism is basically a moral failing and anti-progress.

The exciting part about this belief system is how little you actually have to point to hard numbers and, indeed, rely on faith. You can just entirely vibe it. It FEELS better and more powerful to you, your spins on the LLM slot machine FEEL smarter and more usable, it FEELS like you're getting more done. It doesn't matter if those things are actually true over the long run, it's about the feels. If someone isn't sharing your vibes about the LLM slot machine, that's entirely their fault and problem.

And on the other side, to detractors, AI and LLMs cannot ever succeed. There's always another goalpost to shift.

If it seems to work well, it's because it's copying training data. Or it sometimes gets something wrong, so it's unreliable.

If they say it boosts their productivity, they're obviously deluded as to where they're _really_ spending time, or what they were doing was trivial.

If they point to improvements in benchmarks, it's because model vendors are training to the tests, or the benchmarks don't really measure real-world performance.

If the improvements are in complex operations where there aren't benchmarks, their reports are too vague and anecdotal.

The exciting part about this belief system is how little you have to investigate the actual products, and indeed, you can simply rely on a small set of canned responses. You can just entirely dismiss reports of success and progress; that's completely due to the reporter's incompetence and self-delusion.

Claude Sonnet 4.5 is _way_ better than previous sonnets and as good as Opus for the coding and research tasks I do daily.

I rarely use Google search anymore, both because llms got that ability embedded and the chatbots are good at looking through the swill search results have become.

"it's better at coding" is not useful information, sorry. i'd love to hear tangible ways it's actually better. does it still succumb to coding itself in circles, taking multiple dependencies to accomplish the same task, applying inconsistent, outdated, or non-idiomatic patterns for your codebase? has compliance with claude.md files and the like actually improved? what is the round trip time like on these improvements - do you have to have a long conversation to arrive at a simple result? does it still talk itself into loops where it keeps solving and unsolving the same problems? when you ask it to work through a complex refactor, does it still just randomly give up somewhere in the middle and decide there's nothing left to do? does it still sometimes attempt to run processes that aren't self-terminating to monitor their output and hang for upwards of ten minutes?

my experience with claude and its ilk are that they are insanely impressive in greenfield projects and collapse in legacy codebases quickly. they can be a force multiplier in the hands of someone who actually knows what they're doing, i think, but the evidence of that even is pretty shaky: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

the pitch that "if i describe the task perfectly in absolute detail it will accomplish it correctly 80% of the time" doesn't appeal to me as a particularly compelling justification for the level of investment we're seeing. actually writing the code is the simplest part of my job. if i've done all the thinking already, i can just write the code. there's very little need for me to then filter that through a computer with an overly-verbose description of what i want.

as for your search results issue: i don't entirely disagree that google is unusable, but having switched to kagi... again, i'm not sure the order of magnitude of complexity of searching via an LLM is justified? maybe i'm just old, but i like a list of documents presented without much editorializing. google has been a user-hostile product for a long time, and its particularly recent quality collapse has been well-documented, but this seems a lot more a story of "a tool we relied on has gotten measurably worse" and not a story of "this tool is meaningfully better at accomplishing the same task." i'll hand it to chatgpt/claude that they are about as effective as google was at directing me to the right thing circa a decade ago, when it was still a functional product - but that brings me back to the point that "man, this is a lot of investment and expense to arrive at the same result way more indirectly."

The biggest issue with Sonnet 4.5 is that it's chatty as fuuuck. It just won't shut up, it keeps producing massive markdown "reports" and "summaries" of every single minor change, wasting precious context.

With Sonnet 4 I rarely ran out of quota unexpectedly, but 4.5 chews through whatever little Anthropic gives us weekly.

Gpt5 isn't an improvement to me, but Claude sonnet4.5, handle terragrunt way, way better than the previous version did. It also go search AWS documentation by itself, and parse external documents way better. That's not LLM improvement, to be clear (except the terragrunt thing), I think it's improvement in data acquisition and a better inference engine. On react project it seems way, way less messy also, I have to use it more but the inference engine seems clearer. At least less prone to circular code, where it's stuck in a loop. It seems to be exiting the loop faster, even when the output isn't satisfactory (which isn't an issue to me, most of my prompt have more or less 'only write functions template, do not write the inside logic if it has to contain more than a loop', I fill the blanks myself)
I’m curious what you are expecting when you say progress has stagnated?
>> The marketing that enabled the capital, that enables that scale was what caused the insane growth, and capital can't grow forever,

Striking parallels between AI and food delivery (uber eats, deliveroo, lieferando, etc.) ... burn capital for market share/penetration but only deliver someone else's product with no investment to understand the core market for the purpose of developing a better product.

> I know I could be eating my words, but there is basically no evidence to suggest it ever becomes as exceptional as the kingmakers are hoping.

??? It has already become exceptional. In 2.5 years (since chatgpt launched) we went from "oh, look how cute this is, it writes poems and the code almost looks like python" to "hey, this thing basically wrote a full programming language[1] with genz keywords, and it mostly works, still has some bugs".

I think the goalpost moving is at play here, and we quickly forget how 1 year makes a huge difference (last year you needed tons of glue and handwritten harnesses to do anything - see aider) and today you can give them a spec and get a mostly working project (albeit with some bugs), 50$ later.

[1] - https://github.com/ghuntley/cursed

I don't disagree with you on the technology, but mostly my comment is about what the market is expecting. With such a huge capex expenditure it is expecting a huge returns. Given AI has not proven consistent ROI generally for other enterprises (as far as I know), they are hoping for something better than what is right now and they are hoping for it to happen before the money runs out.

I am not saying it's impossible, but there is no evidence that the leap in technology to reach wild profitability (replacing general labour) such investment desires is just around the corner either.

After 3 years, I would like to see pathways.

Let say we found a company that already realized 5-10% of savings in the first step. Now, based on this we might be able to map out the path to 25-30% savings in 5% steps for example.

I personally haven’t seen this, but I might have missed it as well.

Three years? One year ago I tried using LLMs for coding and found it to be more trouble than it was worth, no benifit in time spent or effort made. It's only within the past several months that this gas changed, IMHO.
To phrase this another way, using old terms: We seem to be approaching the uncanny valley for LLMs, at which point the market overall will probably hit the trough of disillusionment.
It doesn't really matter what the market is expecting at this point, the president views AI supremacy as non-negotiable. AI is too big to fail.
It’s true, but not just the presidency. The whole political class is convinced that this is the path out of all their problems.
...Is it the whole political class?

Or is it the whole political party?

I am not from the US, but your administration could still fumble the AI bust even if it wants to avoid it. Who knows maybe they are hoping to short it.
That there is a bubble is absolutely certain. If for no other reason, than because investors don't understand the technology and don't know which companies are for real and which are essentially scams, they dump money into anything with the veneer of AI and hope some of it sticks. We're replaying the dotcom bubble, a lot of people are going to get burned, a lot of companies will turn out to be crap. But at the end of the dotcom crash we had some survivors standing above the rest and the whole internet thing turned out to have considerable staying power. I think the same will happen with AI, particularly agentic coding tools. The technology is real and will stick with us, even after the bubble and crash.
I feel like the invention of MCP was a lot more instrumental to that than model upgrades proper. But look at it as a good thing, if you will: it shows that even if models are plateauing, there's a lot of value to unlock through the tooling.
> it shows that even if models are plateauing,

The models aren't plateauing (see below).

> invention of MCP was a lot more instrumental [...] than model upgrades proper

Not clear. The folks at hf showed that a minimal "agentic loop" in 100 LoC [1] that gives the agent "just bash access" still got very close to SotA with all the bells and whistles (and surpassed last year models w/ handcrafted harnesses).

[1] - https://github.com/SWE-agent/mini-swe-agent

Small focused (local) model + tooling is the future, not online LLMs with monthly costs. Your coding model doesn't need all of the information in the world built in, it needs to know code and have tools available to get any information it needs to complete its tasks. We have treesitter, MCPs, LSPs, etc - use them.

The problem is that all the billions (trillions?) of VC money go to the online models because they're printing money at this point.

There's no money to be made in creating models people can run locally for free.

I mean, that's still proving the point that tooling matters. I don't think his point was "MCP as a technology is extraordinary" because it's not.
MCP is a marketing ploy, not an “invention”.
It is an actual invention that has concrete function, whether or not it was part of a marketing push.
I didn't realize generating the gen-z programming language was a goalpost in the first place
The question in your last paragraph is not the only one that matters. Funding the technology at a material loss will not be off the table. Think about why.
Just tell us why you think funding at a loss at this scale is viable, don’t smugly assign homework
Apologies, not meant to be smug
...But you did fully intend to assign homework? Why are you even commenting, what are you adding?
I have had LLMs write entire codebases for me, so it's not like the hype is completely wrong. It's just that this only works if what you want is "boring", limited in scope and on a well-trodden path. You can have an LLM create a CRUD application in one go, or if you want to sort training data for image recognition you can have it generte a one-off image viewer with shortcuts tailored to your needs for this task. Those are powerful things and worthy of some hype. For anything more complex you very quickly run into limits and the time and effort to do it with an LLM quickly approaches the time and effort required to do it by hand.
They're powerful, but my feeling is that largely you could do this pre-LLM by searching on Stack Overflow or copying and pasting from the browser and adapting those examples, if you knew what you were looking for. Where it adds power is adapting it to your particular use case + putting it in the IDE. It's a big leap but not as enormous a leap as some people are making out.

Of course, if you don't know what you are looking for, it can make that process much easier. I think this is why people at the junior end find it is making them (a claimed) 10x more productive. But people who have been around for a long time are more skeptical.

> Where it adds power is adapting it to your particular use case + putting it in the IDE. It's a big leap but not as enormous a leap as some people are making out.

To be fair, this is super, super helpful.

I do find LLMs helpful for search and providing a bunch of different approaches for a new problem/area though. Like, nothing that couldn't be done before but a definite time saver.

Finally, they are pretty good at debugging, they've helped me think through a bunch of problems (this is mostly an extension of my point above).

Hilariously enough, they are really poor at building MCP like stuff, as this is too new for them to have many examples in the training data. Makes total sense, but still endlessly amusing to me.

Why bother searching yourself? This is pre-LLM: https://github.com/drathier/stack-overflow-import
> Of course, if you don't know what you are looking for, it can make that process much easier.

Yes. My experience is that LLMs are really, really good at understanding what you are trying to say and bringing up the relevant basic information. That's a task we call "search", but it is different from the focused search people do most of the time.

Anyway, by the nature of the problem, that's something that people should do only a few times for each subject. There is not a huge market opportunity there.

Doing it the old fashioned lazy way, copy-pasting snippets of code you search for on the internet and slightly modifying each one to fit with the rest of your code, would take me hours to achieve the kind of slop that claude code can one shot in five minutes.

Yeah yeah, call me junior or whatever, I have thick skin. I'm a lazy bastard and I no longer care about the art of the craft, I just want programs tailored to my tastes and agentic coding tools are by far the fastest way to get it. 10x doesn't even come close, it's more like 100x just on the basis of time alone. Effort? After the planning stage I kick back with video games while the tool works. Far better than 100x for effort.

i have seen so many people say that, but the app stores/package managers aren't being flooded with thousands of vibe coded apps, meanwhile facebook is basically ai slop. can you share your github? or a gist of some of these "codebases"
You seem critical of people posting AI slop on Facebook (so am I) but also want people to publish more AI slop software?

The AI slop software I've been making with Claude is intended for my own personal use. I haven't read most of the code and certainly wouldn't want to publish it under my own name. But it does work, it scratches my itches, fills my needs. I'm not going to publish the whole thing because that's a whole can of worms, but to hopefully satisfy your curiosity, here is the main_window.py of my tag-based file manager. It's essentially a CRUD application built with sqlite and pyside6. It doesn't do anything terribly adventurous, the most exciting it gets is keeping track of tag co-occurances so it can use naive Bayesian classifiers to recommend tags for files, order files by how likely they are to have a tag, etc.

Please enjoy. I haven't actually read this myself, only verified the behavior: https://paste.debian.net/hidden/c6a85fac

> "the app stores/package managers aren't being flooded with thousands of vibe coded apps"

The state of claude code presently is definitely good enough to churn out low effort shovelware. Insofar as that isn't evidently happening, I can only speculate about the reasons. In no order, it may be one or several of these reasons: Lots of developers feel threatened by the technology and won't give it a serious whirl. Non-developers are still stuck in the mindset of writing software being something they can't do. The general public isn't as aware of the existence of agentic coding tools as we on HN are. The appstores are being flooded with slop, as they always have been, and some of that slop is now AI slop, but doesn't advertise this fact, and the appstore algorithms generally do some work to suppress the visibility of slop anyway. Most people don't have good ideas for new software and don't have the reflex to develop new software to scratch their itches, instead they are stuck in the mentality of software consumers. Just some ideas..

It’s hardly slop when you have over a 100 different sources referenced in a targeted paper.
> Current state-of-the-art models are, in my experience, very good at writing boilerplate code or very simple architecture especially in projects or frameworks where there are extremely well-known opinionated patterns (MVC especially).

Which makes sense, considering the absolutely massive amount of tutorials and basic HOWTOs that were present in the training data, as they are the easiest kind of programming content to produce.

The purpose of an LLM is not to do your job, it's to do enough to convince your boss to sack you and pay the LLM company some portion of your salary.

To that end, it doesn't matter if it works or not, it just has to demo well.

> Current state-of-the-art models are, in my experience, very good at writing boilerplate code or very simple architecture especially in projects or frameworks where there are extremely well-known opinionated patterns (MVC especially).

Yes, kind of. What you downplay as "extremely well-known opinionated patterns" actually means standard design patterns that are well established and tried-and-true. You know, what competent engineers do.

There's even a basic technique which consists of prompting agents to refactor code to clean it up to comply with best practices, as this helps agents evaluate your project as it lines them up with known patterns.

> What they are genuinely impressive at is parsing through large amounts of information to find something (eg: in a codebase, or in stack traces, or in logs).

Yes, they are. It helps if a project is well structured, clean, and follow best practices. Messy projects that are inconsistent and evolve as big balls of mud can and do judge LLMs to output garbage based on the garbage that was inputted. Once, while working on a particularly bad project, I noticed GPT4.1 wasn't even managing to put together consistent variable names for domain models.

> But this hype machine of 'agents creating entire codebases' is surely just smoke and mirrors - at least for now.

This really depends on what are your expectations. A glass half full perspective clearly points you to the fact that yes agents can and do create entire codebases. I know this to be a fact because I did it already just for shits and giggles. A glass half empty perspective however will lead people to nitpick their way into asserting agents are useless at creating code because they once prompted something to create a Twitter code and it failed to set the right shade of blue. YMMV and what you get out is proportional to the effort you put in.

What is novel code?

  1. LLM's would suck at coming up with new algorithms. 
  2. I wouldn't let an LLM decide how to structure my code. Interfaces, module boundaries etc
Other than that, given the right context (the sdk doc for a unique hardware for eg) and a well organised codebase explained using CLAUDE.Md they work pretty well in filling out implementations. Just need to resist the temptation to prompt while the actual typing would take seconds.
Yep, LLMs are basically at the "really smart intern" level. Give them anything complex or that requires experience and they crash and burn. Give them a small, well-specified task with limited scope and they do reasonably well. And like an intern they require constant check-ins to make sure they're on track.

Of course with real interns you end up at the end with trained developers ready for more complicated tasks. This is useful because interns aren't really that productive if you consider the amount of time they take from experienced developers, so the main benefit is producing skilled employees. But LLMs will always be interns, since they don't grow with the experience.