Hacker News new | ask | show | jobs
by v_CodeSentinal 138 days ago
This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.

The only fix is tight verification loops. You can't trust the generative step without a deterministic compilation/execution step immediately following it. The model needs to be punished/corrected by the environment, not just by the prompter.

6 comments

Yes, and better still the AI will fix its mistakes if it has access to verification tools directly. You can also have it write and execute tests, and then on failure, decide if the code it wrote or the tests it wrote are wrong, snd while there is a chance of confirmation bias, it often works well enough
> decide if the code it wrote or the tests it wrote are wrong

Personally I think it's too early for this. Either you need to strictly control the code, or you need to strictly control the tests, if you let AI do both, it'll take shortcuts and misunderstandings will much easier propagate and solidify.

Personally I chose to tightly control the tests, as most tests LLMs tend to create are utter shit, and it's very obvious. You can prompt against this, but eventually they find a hole in your reasoning and figure out a way of making the tests pass while not actually exercising the code it should exercise with the tests.

I haven’t found that to be the case in practice. There is a limit on how big the code can be so it can do it like this, and it still can’t reliably subdivide problems on its own (yet?), but give it a module that is small enough it can write the code and the tests for it.

You should never let the LLM look at code when writing tests, so you need to have it figure out the interface ahead of time. Ideally, you wouldn’t let it look at tests when it was writing code, but it needs to tell which one was wrong. I haven’t been able to add an investigator into my workflow yet, so I’m just letting the code writer run and evaluate test correctness (but adding an investigator to do this instead would avoid confirmation bias, what you call it finding a loophole).

> I haven’t found that to be the case in practice.

Do you have any public test code you could share? Or create even, should be fast.

I'm asking because I hear this constantly from people, and since most people don't have as high standards for their testing code as the rest of the code, it tends to be a half-truth, and when you actually take a look at the tests, they're as messy and incorrect as you (I?) think.

I'd love to be proven wrong though, because writing good tests is hard, which currently I'm doing that part myself and not letting LLMs come up with the tests by itself.

I'm doing all my work at Google, so its not like I can share it so easily. Also, since GeminiCLI doesn't support sub-agents yet...I've had to get creative with how I implement my pipelines. The biggest challenge I've found ATM is controlling conversation context so you can control what the AI is looking at when you do things (e.g. not looking at code when writing tests!). I hope I can release what I'm doing eventually, although it isn't a key piece of AI tech (just a way to orchestrate the pipeline to make sure that AI gets different context for different parts of the pipeline steps, it might be obsolete after we get better support for orchestrating dev work in GeminiCLI or other dev-oriented AI front ends).

The tests can definitely be incorrect, and are often incorrect. You have to tell the AI that consider that the tests might be wrong, not the implementation, and it will generally take a closer look at things. They don't have to be "good" tests, just good enough tests to get the AI writing not crap code. Think very small unit tests that you normally wouldn't think about writing yourself.

> They don't have to be "good" tests, just good enough tests to get the AI writing not crap code. Think very small unit tests that you normally wouldn't think about writing yourself.

Yeah, those for me are all not "good tests", you don't want them in your codebase if you're aiming for a long-term project. Every single test has to make sense and be needed to confirm something, and should give clear signals when they fail, otherwise you end locking your entire codebase to things, because knowing what tests are actually needed or not becomes a mess.

Writing the tests and let the AI write the implementation ends you up with code you know what it does, and can confidently say what works vs not. When the IA ends up writing the tests, you often don't actually know what works or not, not even by scanning the test titles you often don't learn anything useful. How is one supposed to be able to guarantee any sort of quality like that?

Testing is fun, but getting all the scaffolding in place to get to the fun part and do any testing suuuuucks. So let the LLM write the annoying parts (mocks. so many mocks.) while you do the fun part
> LLMs will invent a method that sounds correct but doesn't exist in the library

I find that this is usually a pretty strong indication that the method should exist in the library!

I think there was a story here a while ago about LLMs hallucinating a feature in a product so in the end they just implemented that feature.

Honestly, I feel humans are similar. It's the generator <-> executive loop that keeps things right
So you want the program to always halt at some point. How would you write a deterministic test for it?
I imagine you would use something that errs on the side of safety - e.g. insist on total functional programming and use something like Idris' totality checker.
I've been using codex and never had a compile time error by the time it finishes. Maybe add to your agents to run TS compiler, lint and format before he finish and only stop when all passes.
I’m not sure why you were downvoted. It’s a primary concern for any agentic task to set it up with a verification path.
This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.

Often, if not usually, that means the method should exist.

Only if it's actually possible and not a fictional plot device aka MacGuffin.