Hacker News new | ask | show | jobs
by _pdp_ 249 days ago
This is exactly the direction I am seeing agent go. They should be able to write their own tools and we are soon launching something about that.

That being said...

LLMS are amazing for some coding tasks and fail miserably at others. My hypothesis is that there is some sort of practical limit to how many concepts an LLM can hold into account no matter the context window given the current model architectures.

For a long time I wanted to find some sort of litmus test to measure this and I think I found one that is an easy to understand programming problem, can be done in a single file, yet complex enough. I have not found a single LLM to be able to build a solution without careful guidance.

I wrote more about this here if you are interested: https://chatbotkit.com/reflections/where-ai-coding-agents-go...

3 comments

> For a long time I wanted to find some sort of litmus test to measure this and I think I found one that is an easy to understand programming problem, can be done in a single file, yet complex enough. I have not found a single LLM to be able to build a solution without careful guidance.

Plan for solving this problem:

- Build a comprehensive design system with AI models

- Catalogue the components it fails on (like yours)

- These components are the perfect test cases for hiring challenges (immune to “cheating” with AI)

- The answers to these hiring challenges can be used as training data for models

- Newer models can now solve these problems

- You can vary this by framework (web component / React / Vue / Svelte / etc.) or by version (React v18 vs React v19, etc.)

What you’re doing with this is finding the exact contours of the edge of AI capability, then building a focused training dataset to push past those boundaries. Also a Rosetta Stone for translating between different frameworks.

I put a brain dump about the bigger picture this fits into here:

https://jim.dabell.name/articles/2025/08/08/autonomous-softw...

also training data quality. they are horrifyingly bad at concurrent code in general in my experience, and looking at most concurrent code in existence.... yeah I can see why.
The really depressing part about LLMs (and the limitations of ML more generally) is that humans are really bad at formal logic (which is what programming basically is), and instead of continuing the path of making machines that made it harder for us to get it wrong, we instead decided to toss every open piece of code/text in existence into a big machine that then reproduces those patterns non-deterministically and use that to build more programs.

One can see the results in a place where most code is terrible (data science is the place I see this most, as it's what I do mostly) but most people don't realise this. I assume this also happens for stuff like frontend, where I don't see the badness because I'm not an expert.

> is that humans are really bad at formal logic (which is what programming basically is),

The tricky part is that I don't think all programming is formal logic at all, just a small part. And this thing with that different code is for different purposes really screws up LLMs reasoning process unless you make it really clear what code is for what.

> The tricky part is that I don't think all programming is formal logic at all, just a small part.

Why do you say this? The foundation of all of computer science is formal logic and symbolic logic.

Lots of parts are more creative or more "for humans" I might say, like building the right abstractions considering the current context and potentially future contexts. There is no "right/wrong" abstractions, just abstractions with different tradeoffs, and lots of things in programming is like this, not a binary "this is correct, this is wrong", but somewhere along a spectrum of "This is what I subjectively prefer considering these tradeoffs".

There is a reason a lot of programmers see programming having lots of similarities with painting and other creative activities.

Any programmer who doesn't understand the basis of their craft and the environment they're working in isn't a very good one imo.
> Why do you say this? The foundation of all of computer science is formal logic and symbolic logic.

Yes, but also it has to deal with "the real world" which is only logical if you can encode a near infinite number of variables, instead we create leaky abstractions in order to actually get work done.

And those abstractions need to be encoded using symbolic and formal logic.
We basically throw rigor out the window and hope it doesn't hit anybody on the way down.
Or when code is fully vectorizable they default to using loops even if explicitly told not to yse loops. Code I got a LLM to solve for a fairly straightforward problem took 18 minutes to run.

my own solution? 1.56 seconds. I consider myself to be at an intermediate skill level, and while LLMs are useful, they likely wont replace any but the least talented programmers. Even then i'd value human with critial thinking paired with an LLM over an even more competent LLM.

Codex (GPT-5) + Rust (with or without Tokio) seems to work out well for me, asking it to run the program and validate everything as it iterates on a solution. I've used the same workflow with Python programs too and seems to work OK, but not as well as with Rust.

Just for curiosities sake, what language have you been trying to use?

mostly Go, because that's at work. for a variety of reasons, I have helped troubleshoot at least 100+ teams' projects, many of which have had concurrency issues either obvious in nearby code, or causing issues (which is why I was helping troubleshoot). same with several dozen "help us find a way to speed up [this activity]" teams' work.

this is not at all a sample of high-quality, well-educated-about-concurrency code, but it does roughly match a lot of Business™ code and also most less-mature open source code I encounter (which is most open source code). it's just not something most people are fluent with.

these same people using LLMs have generally produced much worse concurrent code, regardless of the model or their prompting-sophistication or thinking-time, unless it's extremely trivial (then it's slightly better, because it's at least tutorial-level correct) (and yes, they should have just used one of many pre-existing libraries in these cases). doing anything except like "5 workers on this queue plz" consistently ends up with major correctness flaws - often it works well enough while everything is running smoothly, but under pressure or in error cases it falls apart extremely badly... which is true for most "how to write x currently" blog posts I run across too - they're over-simplified to the point of being unusable in practice (e.g. by ignoring error handing) and far too inflexible to safely change for slightly different needs.

honestly I think it's mostly due to two things: a lack of quality training material (some obviously exists, but it's overwhelmed by flawed stuff), and an extreme sensitivity to subtle flaws (much more so than normal code). so it's both bad at generalizing (not enough transitional examples between targets), and its general lack of ability to actually think introducing flaws that look like normal code that humans are less likely to notice (due to their own lack of experience).

this is not to claim it's not possible to use them to write good concurrent code, there are many counter-examples that show it is. but it's a particularly error-prone area in practice, especially in languages without much safer patterns or built-in verification.

In my experience, because the Clojure concurrency model is just incredibly sane and easy to get right, LLMs have no difficulty with it.
With the upcoming release of Gemini 3.0 Pro, we might see a breakthrough for that particular issue. (Those are the rumors, at least.) I'm sure not fully solved, but possibly greatly improved.