Hacker News new | ask | show | jobs
by krastanov 87 days ago
This is fascinating to me. I completely believe you and I will not bother you with all the common "but did you try to tell it this or that" responses, but this is such a different experience from mine. I did the exact same task with claude in the Julia language last week, and everything worked perfectly. I am now in the habit of adding "keep it simple, use only public interfaces, do not use internals, be elegant and extremely minimal in your changes" to all my requests or SKILL.md or AGENTS.md files (because of the occasional failure like the one you described). But generally speaking, such complete failures have been so very rare for me, that it is amazing to see that others have had such a completely different experience.
4 comments

My experience with working with AI agents is that they can be verbose and do things that are too over complicated by default. Unless directed explicitly. Which may be the reason for this discrepancy.
You say it doesn't fail but you also mention all these work around you know and try...sounds like it fails a lot but your tolerance is different.
Most people I've seen complain say things like "I asked it for code and it didn't compile."

The real magic of LLMs comes when they iterate until completion until the code compiles and the test passes, and you don't even bother looking at it until then.

Each step is pretty stupid, but the ability to very quickly doggedly keep at it until success quite often produces great work.

If you don't have linters that are checking for valid syntax and approved coding style, if you don't have tests to ensure the LLM doesn't screw up the code, you don't have good CI, you're going to have a bad time.

LLMs are just like extremely bright but sloppy junior devs - if you think about putting the same guardrails in place for your project you would for that case, things tend to work very well - you're giving the LLM a chance to check its work and self correct.

It's the agentic loop that makes it work, not the single-shot output of an LLM.

Stuff like this works for things that can be verified programmatically (though I find LLMs still do occasionally ignore instructions like this), but ensuring correct functionality and sensible code organization are bigger challenges.

There are techniques that can help deal with this but none of them work perfectly, and most of the time some direct oversight from me is required. And this really clips the potential productivity gains, because in order to effectively provide oversight you need to page in all the context of what's going on and how it ought to work, which is most of what the LLMs are in-theory helping you with.

LLMs are still very useful for certain tasks (bootstrapping in new unfamiliar domains, tedious plumbing or test fixture code), but the massive productivity gains people are claiming or alluding to still feel out of reach.

It depends - there are some very very difficult things that can still be easily verifiable!

For instance, if you are working on a compiler and have a huge test database of code to compile that all has tests itself, "all sample code must compile and pass tests, ensuring your new optimizer code gets adequate branch coverage in the process" - the underlying task can be very difficult, but you have large amounts of test coverage that have a very good chance at catching errors there.

At the very least "LLM code compiles, and is formatted and documented according to lint rules" is pretty basic. If people are saying LLM code doesn't compile, then yes, you are using it very incorrectly, as you're not even beginning to engage the agentic loop at all, as compiling is the simplest step.

Sure, a lot of more complex cases require oversight or don't work.

But "the code didn't compile" is definitely in "you're holding it wrong" territority, and it's not even subtle.

Yeah performance optimization is potentially another good area for LLMs to shine, if you already have a sufficiently comprehensive test suite, because no functionality is changing. But if functionality is changing, you need to be in the loop to, at the very least, review the tests that the LLM outputs. Sometimes that's easier than reviewing the code itself, but other times I think it requires similar levels of context.

But honestly I think sane code organization is the bigger hurdle, which is a lot harder to get right without manual oversight. Which of course leads to the temptation to give up on reviewing the code and just trusting whatever the LLM outputs. But I'm skeptical this is a viable approach. LLMs, like human devs, seem to need reasonably well-organized code to be able to work in a codebase, but I think the code they output often falls short of this standard.

(But yes agree that getting the LLM to iterate until CI passes is table-stakes.)

Strongly agreed!

I think getting good code organization out of an LLM is one of the subtler things - I've learned quite a bit about what sort of things need to be specified, realizing that the LLM isn't actively learning my preferences particularly well, so there are some things about code organization I just have to be explicit about.

Which is more work, but less work than just writing the code myself to begin with.

> The real magic of LLMs comes when they iterate until completion until the code compiles and the test passes, and you don't even bother looking at it until then.

If you read my post, you’d see that Claude code didn’t do that, I had to intervene in the agent loop and when I did it undid my fixes.

This is not compatible with many people's experiences. I use Python with a type checker. I tell Claude that the task is only complete once the type checker passes cleanly. It doesn't stop until there are no type errors. This should be even easier in a compiled language, especially if you also tell it to run the tests.

In fact, I find that with a strict feedback loop set up (i.e. a lot of lint rules, a strict type checker and fast unit tests), it will almost always generate what I want.

As someone else said, each step might be pretty stupid, but if you have a fast iteration loop, it can run until everything passes cleanly. My recommendation is to specify what really counts as "done" in your AGENTS.md/CLAUDE.md.

I tried again this morning - https://news.ycombinator.com/item?id=47487638 another hard failure that Claude code says passed.
Providing instruction and context doesn’t seem like a “workaround”.
Agents are LLMs that use tools in a loop

I didn't give my Agent any tools, it hallucinated code.

...well, yes. It's like evaluating a programmer's skill by having one-shot a program on paper with zero syntax errors.

IME it does fail pretty hard at first. One has to build up a little library of markdown and internal library of prompt techniques. Then it starts working okay. I agree there is a hurdle still, trying it on one task doesn't really get one over the hurdle.
Idk. Had a friend recommend me gsdv2 and I wasted like 100$ + so much time trying to debug said crap. I went back to codex and it 1 shot my problems easily.

And this was from two people who were 100% aligned on agentic AI coding. I've been using AI for years now and agentic AI for several months now. I was told that I bring out the worst in LLMs. Except... I was able to achieve better results, WITH LLMs, on OTHER frameworks. So like, ?

It may be easier to draw the boundary between "AI and non AI users", but as AI becomes prolific, the us vs them angle that people keep using won't apply anymore.

The age old "user versus tool" debate goes on, but it seems like gaslighting is popular these days. I classify it as gaslighting because I'm clearly a falsifiable test case, I'm even gung ho for LLMs, yet any kind of dissent is immediately warped into user error. It doesn't matter what you say or where you are on the spectrum, if you have 1 bad experience and speak up about it it's an issue. Guess that's not really an issue for the human condition though.

It's almost like.... LLMs are non-deterministic and hallucinating... Oh wait?!
non-deterministic does not mean it is not biased towards a particular type of results (helpful results)