Hacker News new | ask | show | jobs
by 112233 185 days ago
I have yet to find the niche where it is "good at the beginning". So far I've mostly tried asking to build C tools that use advanced linux API.

Me: hey make this, detailed-spec.txt

AI: okidoki (barfs 9k lines in 15 minutes) all done and tested!

Me looks at the code, that has feature-sounding names, but all features are stubs, all tests are stubs, and it does not compile.

Me: it does not compile.

AI: Yes, but the code is correct. Now that the project is done, which of these features you want me to add (some crazy list)

Me: Please get it to compile.

AI: You are absolutely right! This is an excellent idea! (proceeds to stub and delete most of what it barfed). I feel really satisfied with the progress! It was a real challenge! The code you gave me was very poorly written!

... and so on.

6 comments

I'm not sure what you're using. I've used Claude in agent mode to port a very complex and spaghetti coded C application to nicely structured C++. The original code was so intertwined that I didn't want to figure out so I had shelved the project until AI came along.

It wasn't super bad at converting the code but even it struggled with some of the logic. Luckily, I had it design a test suite to compare the outputs of the old application and the new one. When it couldn't figure out why it was getting different results, it would start generating hex dumps comparisons, writing small python programs, and analyzing the results to figure out where it had gone wrong. It slowly iterated on each difference until it had resolved them. Building the code, running the test suite, comparing the results, changing the code, repeat. Some of the issues are likely bugs in the original code (that it fixed) but since I was going for byte-for-byte perfection it had to re-introduce them.

The issues you describe I have seen but not with the right technology and not in a while.

At the high level, you asked LLM to translate N lines of code to maybe 2N lines of code, while GP asked LLM to translate N lines of English to possibly 10N lines of code. Very different scenarios.
The OP said the LLM didn't build anything, said it was great, and didn't even compile it. My experience has been far the opposite: not only compiling it and fixing compile time errors but also running it and fixing runtime issues as well. Even going so far as to write waveform analysis tools in Python (the output of this project was WAV files) to determine the issues.

It doesn't really matter what we told it do; a task is a task. But clearly how each LLM performed that task very different for me than the OP.

LLMs are non-deterministic for everyone. Give it time.
I'll be the first to say I've abandoned a chat and started a new one to get the result I want. I don't see that as a net negative though -- that's just how you use it.
Are you sure claude didn't do exactly the same thing but the harness, claude code, just hid it from you?

I have seen AI agents fall into the exact loop that GP discussed and needed manual intervention to fall out of.

Also blindly having the AI migrate code from "spaghetti C" to "structured C++" sounds more like a recipe for "spaghetti C" to "fettuccine C++".

Sometimes its hidden data structures and algorithms you want to formalize when doing a large scale refactor and I have found that AIs are definitely able to identify that but it's definitely not their default behaviour and they fall out of that behaviour pretty quickly if not constantly reminded to do so.

> Are you sure claude didn't do exactly the same thing but the harness, claude code, just hid it from you?

What do you mean? Are you under the impression I'm not even reading the code? The code is actually the most important part because I already have working software but what I want is working software that I can understand and work with better (and so far, the results have been good).

Reading the code and actually understanding the code is not that the same thing.

"This looks good", vs "Oh that is what this complex algorithm was" is a big difference.

Effectively, to review that the code is not just being rewritten into the same code but with C++ syntax and conventions means you need to understand the original C code, meaning the hard part was not the code generation (via LLM or fingers) but the understanding and I'm unsure the AI can do the high level understanding since I have never gotten it to produce said understanding without explicitly telling it.

Effectively, "x.c, y.c, z.c implements a DSL but is convoluted and not well structured, generate the same DSL in C++" works great. "Rewrite x.c, y.c, z.c into C++ buildings abstractions to make it more ergonomic" generally won't recognise the DSL and formalise it in a way that is very easy to do in C++, it will just make it "C++" but the same convoluted structure exists.

> Reading the code and actually understanding the code is not that the same thing.

Ok. Let me be more specific then. I'm "understanding" the code since that's the point.

> I'm unsure the AI can do the high level understanding since I have never gotten it to produce said understanding without explicitly telling it.

My experience has been the opposite: it often starts by producing a usable high-level description of what the code is doing (sometimes imperfectly) and then proposes refactors that match common patterns -- especially if you give it enough context and let it iterate.

> "Rewrite x.c, y.c, z.c into C++ buildings abstractions to make it more ergonomic" generally won't recognise the DSL and formalise it in a way that is very easy to do in C++, it will just make it "C++" but the same convoluted structure exists.

That can happen if you ask for a mechanical translation or if the prompt doesn't encourage redesign. My point was literally make it well-designed idiomatic C++ and it did that. Inside of the LLM training is a whole bunch of C++ code and it seems to be leaning on that.

I did direct some goals (e.g., separating device-specific code and configuration into separate classes so adding a device means adding a class instead of sprinkling if statements everywhere). But it also made independent structural improvements: it split out data generation vs file generation into pipeline/stream-like components and did strict separation of dependencies. It's actually well designed for unit testing and mocking even though I didn't tell it I wanted that.

I'm not claiming it has human-level understanding or that it never makes mistakes -- but "it can't do high-level understanding" doesn't match what I'm seeing in practice. At minimum, it can infer the shape of the application well enough to propose and implement a much more ergonomic architecture, especially with iterative guidance.

I had to have it introduce some "bugs" for byte-for-byte matching because it had generalized some of the file generation and the original C code generated slightly different file structures for different devices. There's no reason for this difference; it's just different code trying to do the same thing. I'll probably remove these differences when the whole thing is done.

That clarifies a lot.

So effectively it was at least partly guided refactoring. Not blind vibe coding.

Sounds like the debug mode that Cursor just announced.
> I've used Claude in agent mode to port a very complex and spaghetti coded C application to nicely structured C++

You migrated code from one of the simplest programming languages to unarguably the most complex programm language in existence. I feel for you; I really do.

How did you ensure that it didn't introduce any of the myriad of footguns that C++ has that aren't present in C?

I mean, we're talking about a language here that has an entire book just for variable initialisation - choose the wrong one for your use-case and you're boned! Just on variable initialisation, how do you know it used the correct form in all of the places?

I do a lot of C++ programming and that's really over selling the issues. You don't have to read an entire book of variable initialization to do it correctly. And using STL types are a lot safer than passing pointers around.

It's actually far easier to me to tell that it's not leaking memory or accessing some unallocated data in the C++ version than the C version.

A simple language just pushes complexity from the language into the code. Being able to represent things in a more high-level way is entirely the point of this exercise because the C version didn't have the tools to express it more cleanly.

In my case that was Claude Code with Opus.
I don't ever look at LLM-generated code that either doesn't compile or doesn't pass existing tests. IMHO any proper setup should involve these checks, with the LLM either fixing itself or giving up.

If you have a CLI, you can even script this yourself, if you don't trust your tool to actually try to compile and run tests on its own.

It's a bit like a PR on github from someone I do not know: I'm not going to actually look at it until it passes the CI.

> I have yet to find the niche where it is "good at the beginning".

The niche is "the same boring CRUD web app someone made in 2003 but with Tailwind CSS".

Good work, if you can get it.
Holy shit, I feel the same. I was arguing with an LLM one day about how to do Kerberos auth on incoming HTTP requests. It kept giving me bogus advice that I could disprove with a tiny snip of code. I would explain. It would react just like yours. After a few rounds, it would give the first answer again. Awful. So infuriating.

I had a similar issue with GNU plot. The LLM-suggested scripts frequently had syntax errors. I say: LLMs are awesome when they work, else they are a time suck / net negative.

Sometimes they just get into "argument simulator mode". There's a lot of training data of people online having stupid arguments.
You can write any program you want, as long as it is flappy bird in reactjs.
heh
Willing to name “an LLM”?

Was this a local model?

Good question. It was not my intent to be evasive about the LLM. I should have included it in my origial post. I tried the free versions of both OpenAI ChatGPT and Google Gemini. To be clear, when I say "free", I mean just go to the website and start chatting with the bot.
Include in the prompt a verifiable testable exit criteria (compiling) and use agentic AI like cursor or codex with this, you’d be surprised what happens :)
Is claude code with both Sonnet and Opus agentic enough? Because it is constantly finding creative ways to ignore direct, repeated instructions ("user asked X but it is hard, let's do Y instead"), implement fake tests ("feature X is complex. we need to test it completely. let's write script that will create files that feature X would have created, then test that files exist"), sabotage and delete working code ("we need to track FD of the open file (runs strace). The FD is 5 (hardcodes 5 in the code instead of implementing anything useful) tests pass now!")

I have not experienced the level of malice and sweet-talking work avoidance from anyone. It apologizes like an alcoholic, then proceeds doubling down.

Can you force it to produce actually useful code? Yes, by repeatedly yelling at it to please follow the instructions. In the process, it will break, delete, or implement hard to find bugs in rest of the codebase.

I'm really curious, if anyone actually has this thing working, or they simply haven't bothered to read the generated code

You need to use the features that Claude Code gives you in order to be successful with it. Your build and tests should be in a Stop hook that prevent Claude from stopping if the build or tests fail. Combining this with a Stop hook that bails out if the first hook failed n times already prevents infinite loops.

With anything above a toy project, you need to be really good with context window management. Usually this means using subagents and scoping prompts correctly by placing the CLAUDE.md files next to the relevant code. Your main conversation's context window usage should pretty much never be above 50%. Use the /clear command between unrelated tasks. Consider if recurring sequences of tool calls could be unified into a single skill.

Instead of sending instructions to the agent straight away, try planning with it and prompting it to ask your questions about your plan. The planning phase is a good place to give Claude more space to think with "think > think hard > ultrathink". If you are still struggling with the agent not complying, try adding emplasis with "YOU MUST" or "IMPORTANT".

As I'm getting better and better results with it, I'm having it do more and more things. I went through a complete agentic refactor of a project from Angular 17 to Angular 20 (RxJS to Signals) and I'd say it did it perfectly. A few times I'd get it summarize and start a new chat because it can start to get less effective when the history gets too long. I also had to iterate on what I wanted and do things a step a time. Although it was very clear that it also wanted to do things in pieces and test each major change before continuing on.

I think like any tool it's has it's pros and cons and the more you use it the more you figure out how to make the best use out of it and when to give up.

It's terrible at the niches I actually have expertise in, which are in mathematics. I'd guess an expert is going to find the flaws in anything it's doing in their field. That being said, if you're just trying to e.g. see what some GUI library can do then it's pretty useful to get something going. I personally would prefer not using it in anything that's not very much a throwaway test project though, but that is my luxury as a jobless bum.
But doesn't your argument actually mean it is terrible at absolutely everything in a very subtle, convincing way, so that it takes an actual expert in the field to tell that the generated text is not a profound revelation but a bag of nonsense?

Meaning, is the answer in the field I'm not an expert of good, or am I simply being fooled by emoji and nice grammar?

I don't think it's expert, I just don't think being expert is necessary to get some value out of it if you aren't an expert. The trap is letting the charade go on longer than it should though. I personally only see the main value in using it to create test projects or to get the gist of what a library can do. I do think that's pretty valuable, and I also think real expertise is more valuable.

Or you can do like some of the others suggest and eliminate pure vibecoding. Just use it as a back and forth where you understand along the way and make well-reasoned changes. That looks a lot more like real engineering, so it's not surprising the other commenters report better results.

Gell-Mann amnesia, but for LLMs.
It's an interesting concept, but inapplicable here because I don't trust the media reporting on LLMs and I personally believe expert programmers are never going to be replaced. My concept of the value of LLMs is that they are good for generating throwaway test code to assess the use of a library or to prototype a feature.