Hacker News new | ask | show | jobs
by motorest 384 days ago
Taken from the blog:

> Why are we talking about “graduate and PhD-level intelligence” in these systems if they can’t find and verify relevant links — even directly after a search?

This is my pet peeves, and recently OpenAI's models seem to have become very militant in how they stand by and push their obviously hallucinated sources. I'm talking about hallucinating answers, when pressed to cite sources they also hallucinate URLs that never existed, when repeatedly prompted to verify how the are hallucinating the stick to their clearly wrong output, and ultimately fall back to claiming they were right but the URL somehow changed even though it never existed ever.

In order to start talking about PhD-level intelligence, in the very least these LLMs must support PhD-level context-seeking and information verification. It is not enough to output a wall of text that reads quite fluently. You must stick to verifiable facts.

5 comments

The approach of generating something and then looking for hallucinations is just stupid. To validate the output I have to be an expert. How do I become an expert if rely on LLMs? It's a dead end.
> The approach of generating something and then looking for hallucinations is just stupid. To validate the output I have to be an expert.

No. You only need to check for sources, and then verify these sources exist and they support the claims.

It's the very definition of "fact".

In some cases, all you need to do is check if a URL that was cited does exist.

"and suport the claims" is doing some *extremely* heavy lifting there.

I can't write a software program, give the source to the greengrocer and expect him to be able to say anything about its quality. Just like I can't really say much about vegetables.

If the output is interpreting sources rather than just regurgitating quotes from them, you need to exert judgment to verify they support its claims. When the LLM output is about some highly technical subject, it can require expert knowledge just to judge whether the source supports the claims.
Including literal 404s... As an outsider it has always struck me as absurd that they don't just do the equivalent of wget over all provided sources.
Or why the LLM doesn’t do a lookup into a subset of the training data as a database and reject the output if it seems to be wrong. A billion of the most urls and the entirety of Wikipedia, arkiv and stackoverflow would go a long way.
If that could be done, then we would be using that and skipping the llms entirely
Can’t see why that couldn’t be done? You save a http request for a ton of the urls.
Because if the llm could tell right from wrong, it wouldn't have to do this in the first place. It's like the bible clainming it's true because the bible says it's true. Circular logic.
Seems like the LLM is giving correct output if it’s generating a plausible string of tokens in response to your string of tokens.
> Seems like the LLM is giving correct output if it’s generating a plausible string of tokens in response to your string of tokens.

No. If you prompt it to get a response and then you ask it to cite sources, if it outputs broken links that never existed then it clearly failed to deliver correct output.

"correct" for an llm means "fits the statistical distributions in the training data"

"correct" for you is "truth that corresponds to the real world"

They are two very different things. The llm's output is, very much, correct. Because it was never meant to mean anything other than similarity of probability distributions.

It's not what you wanted, but that doesn't make it incorrect. You're just under a wrong assumption about what you were asking for. You were asking for something that looks like it could be true. Even if you ask it to not hallucinate, you're just asking it to make it look like it is not hallucinating. Meanwhile you thought you were asking for the actual, real, answer to your question.

Right, the dialogue between the user and the LLM closely resembles documents used in training the LLM. People argue with, lie to, and misunderstand others on the internet. Here's a totally plausible hypothetical forum discussion:

Person A: I believe X.

Person B: Do you have a source for that?

A: Yes, it was shown by blah blah in the paper yada yada.

B: I don't think that study exists. Share a link?

A: [posts a URL]

B: That's not a real paper. The URL doesn't even work!

A: Works on my machine.

---

I've seen those kind of chats so many times online. Know what I haven't seen very often? When person A says "You're right, I made up that article. Let me look again for a real one, and I might change my opinion depending on what it says."

Why isn't the LLM under the wrong assumption? So I don't get from my tool what I need and it's still me at fault? I am not yet ready to bow to the AI overlords, sorry.
Oh okay, guess all LLMs are just fine then and we don't need to do any further development on them.
But are the links plausible text given the training data?

If the purpose is to accurately cite sources, how is it even possible to hallucinate them? Seems like folks are expecting way too much from these tools. They are not intelligent. Useful, perhaps.

Seems that's just expecting things that LLMs were not designed for.

It's a token producer based on trained weights, it doesn't use any sources.

Even if it were "fixed" so that it only generates URLs that exist, it's still incorrect because it did not use any sources so those URLs are not sources.

Then let's face it: LLMs were not designed to give proper answers. Now that we settled this and the emperor is obviously naked, what?
I have search enabled 100% of the time with ChatGPT and would never go back to raw-dogging LLM citations. O3 especially has passed the threshold of “not always annoying”. Had an argument with Gemini yesterday where it was insisting on some hallucinated implementation of a function even while giving me a GitHub link to the correct source.
This is trivial to overcome by using a REST client to verify the link through MCP, and by caching results it wouldn't even add much latency.