Hacker News new | ask | show | jobs
by all_factz 170 days ago
React is hundreds of thousands of lines of code (or millions - I haven’t looked in awhile). Sure, you can start by having the LLM create a simple way to sync state across components, but in a serious project you’re going to run into edge-cases that cause the complexity of your LLM-built library to keep growing. There may come a point at which the complexity grows to such a point that the LLM itself can’t maintain the library effectively. I think the same rough argument applies to MomentJS.
3 comments

If the complexity grows beyond what it makes sense to do without React I'll have the LLM rewrite it all in React!

I did that with an HTML generation project to switch from Python strings to Jinja templates just the other day: https://github.com/simonw/claude-code-transcripts/pull/2

Simon, you're starting to sound super disconnected from reality, this "I hit everything that looks like a nail with my LLM hammer" vibe is new.
My habits have changed quite a bit with Opus 4.5 in the past month. I need to write about it..
What's concerning to many of us is that you've (and others) have said this same thing s/Opus 4.5/some other model/

That feels more like chasing than a clear line of improvement. It's interrupted very different from something like "my habits have changed quite a bit since reading The Art of Computer Programming". They're categorically different.

It's because the models keep getting better! What you could do with GPT-4 was more impressive than what you could do with GPT 3.5. What you could do with Sonnet 3.5 was more impressive yet, and Sonnet 4, and Sonnet 4.5.

Some of these improvements have been minor, some of them have been big enough to feel like step changes. Sonnet 3.7 + Claude Code (they came out at the same time) was a big step change; Opus 4.5 similarly feels like a big step change.

(If you don't trust vibes, METR's task completion benchmark shows huge improvements, too.)

If you're sincerely trying these models out with the intention of seeing if you can make them work for you, and doing all the things you should do in those cases, then even if you're getting negative results somehow, you need to keep trying, because there will come a point where the negative turns positive for you.

If you're someone who's been using them productively for a while now, you need to keep changing how you use them, because what used to work is no longer optimal.

Models keep getting better but the argument I'm critiquing stays the same.

So does the comment I critiqued in the sibling comment to yours. I don't know why it's so hard to believe we just haven't tried. I have a Claude subscription. I'm an ML researcher myself. Trust me, I do try.

But that last part also makes me keenly aware of their limitations and failures. Frankly I don't trust experts who aren't critiquing their field. Leave the selling points to the marketing team. The engineer and researcher's job is to be critical. To find problems. I mean how the hell do you solve problems if you're unable to identify them lol. Let the marketing team lead development direction instead? Sounds like a bad way to solve problems

  > benchmark shows huge improvements
Benchmarks are often difficult to interpret. It is really problematic that they got incorporated into marketing. If you don't understand what a benchmark measures, and more importantly, what it doesn't measure, then I promise you that you're misunderstanding what those numbers mean.

For METR I think they say a lot right here (emphasis my own) that reinforces my point

  > Current frontier AIs are vastly better than humans at text prediction and knowledge tasks. They outperform experts on most *exam-style problems* for a fraction of the cost. ... And yet the best AI agents are not currently able to carry out substantive projects by themselves or directly substitute for human labor. *They are unable to reliably handle even relatively low-skill*, computer-based work like remote executive assistance. It is clear that capabilities are increasing very rapidly in some sense, but it is unclear how this corresponds to real-world impact.
So make sure you're really careful to understand what is being measured. What improvement actually means. To understand the bounds.

It's great that they include longer tasks but also notice the biases and distribution in the human workers. This is important in properly evaluating.

Also remember what exactly I quoted. For a long time we've all known that being good at leetcode doesn't make one a good engineer. But it's an easy thing to test and the test correlates with other skills that are likely to be learned to be good at those tests (despite being able to metric hack). We're talking about massive compression machines. That pattern match. Pattern matching tends to get much more difficult as task time increases but this is not a necessary condition.

Treat every benchmark adversarialy. If you can't figure out how to metric hack it then you don't know what a benchmark is measuring (and just because you know what can hack it doesn't mean you understand it nor that that's what is being measured)

Opus 4.5 is categorically a much better model from benchmarks and personal experience than Opus 4.1 & Sonnet models. The reason you're seeing a lot of people wax about O4.5 is that it was a real step change in reliable performance. It crossed for me a critical threshold in being able to solve problems by approaching things in systematic ways.

Why do you use the word "chasing" to describe this? I don't understand. Maybe you should try it and compare it to earlier models to see what people mean.

  > Why do you use the word "chasing" to describe this?
I think you'll get the answer to this if you read my comment and your response to understand why you didn't address mine.

Btw, I have tried it. It's annoying that people think the problem is not trying. It was getting old when GPT 3.5 came out. Let's update the argument...

Looking forward to hearing about how you're using Opus 4.5, from my experience and what I've heard from others, it's been able to overcome many obstacles that previous iterations stumbled on
Please do. I'm trying to help other devs in my company get more out of agentic coding, and I've noticed that not everyone is defaulting to Opus 4.5 or even Codex 5.2, and I'm not always able to give good examples to them for why they should. It would be great to have a blog post to point to…
Can you expound on Opus 4.5 a little? Is it so good that it's basically a superpower now? How does it differ from your previous LLM usage?
To repeat my other comment:

> Opus 4.5 is categorically a much better model from benchmarks and personal experience than Opus 4.1 & Sonnet models. The reason you're seeing a lot of people wax about O4.5 is that it was a real step change in reliable performance. It crossed for me a critical threshold in being able to solve problems by approaching things in systematic ways.

Reality is we went from LLMs as chatbots editing a couple files per request with decent results. To running multiple coding agents in parallel to implement major features based on a spec document and some clarifying questions - in a year.

Even IF llms don't get any better there is a mountain of lemons left to squeeze in their current state.

That would go over on any decently sized team like a lead balloon.
As it should, normally, because "we'll rewrite it in React later" used to represent weeks if not months of massively disruptive work. I've seen migration projects like that push on for more than a year!

The new normal isn't like that. Rewrite an existing cleanly implemented Vanilla JavaScript project (with tests) in React the kind of rote task you can throw at a coding agent like Claude Code and come back the next morning and expect most (and occasionally all) of the work to be done.

I’m going to add my perspective here as they seem to all be ganging up on you Simon.

He is right. The game has changed. We can now refactor using an agent and have it done by morning. The cost of architectural mistakes is minimal and if it gets out of hand, you refactor and take a nap anyway.

What’s interesting is now it’s about intent. The prompts and specs you write, the documents you keep that outline your intended solution, and you let the agent go. You do research. Agent does code. I’ve seen this at scale.

And everyone else's work has to be completely put on hold or thrown away because you did the whole thing all at once on your own.

That's definitely not something that goes over well on anything other than an incredibly trivial project.

Why did you jump to the assumption that this:

> The new normal isn't like that. Rewrite an existing cleanly implemented Vanilla JavaScript project (with tests) in React the kind of rote task you can throw at a coding agent like Claude Code and come back the next morning and expect most (and occasionally all) of the work to be done.

... meant that person would do it in a clandestine fashion rather than this be an agreed upon task prior? Is this how you operate?

My very first sentence:

> And everyone else's work has to be completely put on hold

On a big enough team, getting everyone to a stopping point where they can wait for you to do your big bang refactor to the entire code base- even if it is only a day later- is still really disruptive.

The last time I went through something like this, we did it really carefully, migrating a page at a time from a multi page application to a SPA. Even that required ensuring that whichever page transitioned didn't have other people working on it, let alone the whole code base.

Again, I simply don't buy that you're going to be able to AI your way through such a radical transition on anything other than a trivial application with a small or tiny team.

> meant that person would do it in a clandestine fashion rather than this be an agreed upon task prior? Is this how you operate?

This doesn't mean this at all

In an AI heavy project it's not unusual to have many speculative refactors kicked off and then you come back to see what it is like.

Wonder you can do a Rust SIMD optimized version of that Numpy code you have? Try it! You don't even need to waste review time on it because you have heavy test coverage and can see if it is worth looking at.

If you have 100s of devs working on the project it’s not possible to do a full rewrite in one go. So its to about clandestine but rather that there’s just no way to get it done regardless of how much AI superpowers you bring to bear.
Let's say I'm mildly convinced by your argument. I've read your blog post that was popular on HN a week or so ago and I've made similar little toy programs with AI that scratch a particular niche.

Do you care to make any concrete predictions on when most developers will embrace this new normal as part of their day to day routine? One year? Five?

And how much of this is just another iteration in the wheel of recarnation[0]? Maybe we're looking at a future where we see return to the monoculture library dense supply chain that we use today but the libraries are made by swarms of AI agents instead and the programmer/user is responsible for guiding other AI agents to create business logic?

[0] https://www.computerhope.com/jargon/w/wor.htm

It's really hard to predict how other developers are going to work, especially given how resistant a lot of developers are to fully exploring the new tools.

I do think there's been a bit of a shift in the last two months, with GPT 5.1 and 5.2 Codex and Opus 4.5.

We have models that can reliably follow complex instructions over multiple hour projects now - that's completely new. Those of us at the cutting edge are still coming to terms with the consequences of this (as illustrated by this Karpathy tweet).

I don't trust my predictions myself, but I think the next few months are going to see some big changes in terms of what mainstream developers understand these tools as being capable of.

"The future is already here, it's just unevenly distributed."

At some companies, most developers already are using it in their day to day. IME, the more senior the developer is, the more likely they are to be heavily using LLMs to write all/most of their code these days. Talking to friends and former coworkers at startups and Big Tech (and my own coworkers, and of course my own experience), this isn't a "someday" thing.

People who work at more conservative companies, the kind that don't already have enterprise Cursor/Anthropic/OpenAI agreements, and are maybe still cautiously evaluating Copilot... maybe not so much.

"React is hundreds of thousands of lines of code".

Most of which are irrelevant to my project. It's easier to maintain a few hundred lines of self written code than to carry the react-kitchen-sink around for all eternity.

Not all UIs converge to a React like requirement. For a lot of use cases React is over-engineering but the profession just lacks the balls to use something simpler, like htmx for example.
Sure, and for those cases I’d rather tell the agent to use htmx instead of something hand-rolled.
Core react is fairly simple, I would have no problem using it for almost everything. The overengineering usually comes at a layer on top.