|
|
|
|
|
by simonw
352 days ago
|
|
If you define "hallucinations" to mean "any mistakes at all" then yes, a compiler won't catch them for you. I define hallucinations as a a particular class of mistakes where the LLM invents eg a function or method that does not exist. Those are solved by ensuring the code runs. I wrote more about that here: https://simonwillison.net/2025/Mar/2/hallucinations-in-code/ Even beyond that more narrow definition of a hallucination, tool use is relevant to general mistakes made by an LLM. The new Phoenix.new coding agent actively tests the web applications it is writing using a headless browser, for example: https://simonwillison.net/2025/Jun/23/phoenix-new/ The more tools like this come into play, the less concern I have about the big black box of matrices occasionally hallucinating up some code that is broken in obvious or subtle ways. It's still on us as the end users to confirm that the code written for us actually does the job we set out to solve. I'm fine with that too. |
|
That's not quite my definition. If we're judging these tools by the same criteria we use to judge human programmers, then mistakes and bugs should be acceptable. I'm fine with this to a certain extent, even though these tools are being marketed as having superhuman abilities. But the problem is that LLMs create an entirely unique class of issues that most humans don't. Using nonexistent APIs is just one symptom of it. Like I mentioned in the comment below, they might hallucinate requirements that were never specified, or fixes for bugs that don't exist, all the while producing code that compiles and runs without errors.
But let's assume that we narrow down the definition of hallucination to usage of nonexistent APIs. Your proposed solution is to feed the error back to the LLM. Great, but can you guarantee that the proposed fix will also not contain hallucinations? As I also mentioned, in most occasions when I've done this the LLM simply produces more hallucinated code, and I get stuck in a neverending loop where the only solution is for me to dig into the code and fix the issue myself. So the LLM simply wastes my time in these cases.
> The new Phoenix.new coding agent actively tests the web applications it is writing using a headless browser
That's great, but can you trust that it will cover all real world usage scenarios, test edge cases and failure scenarios, and do so accurately? Tests are code as well, and it can have the same issues as application code.
I'm sure that we can continue to make these tools more useful by working around these issues and using better adjacent tooling as mitigation. But the fundamental problem of hallucinations still needs to be solved. Mainly because it affects tasks other than code generation, where it's much more difficult to deal with.