| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jordanpg 200 days ago

Does anyone know, from a technical standpoint, why are citations such a problem for LLMs?

I realize things are probably (much) more complicated than I realize, but programmatically, unlike arbitrary text, citations are generally strings with a well-defined format. There are literally "specs" for citation formats in various academic, legal, and scientific fields.

So, naively, one way to mitigate these hallucinations would be identify citations with a bunch of regexes, and if one is spotted, use the Google Scholar API (or whatever) to make sure it's real. If not, delete it or flag it, etc.

Why isn't something like this obvious solution being done? My guess is that it would slow things down too much. But it could be optional and it could also be done after the output is generated by another process.

1 comments

Muller20 200 days ago

In general, a citation is something that needs to be precise, while LLMs are very good at generating some generic high probability text not grounded in reality. Sure, you could implement a custom fix for the very specific problem of citations, but you cannot solve all kinds of hallucinations. After all, if you could develop a manual solution you wouldn't use an LLM.

There are some mitigations that are used such as RAG or tool usage (e.g. a browser), but they don't completely fix the underlying issue.

link

jordanpg 200 days ago

My point is that citations are constantly making headlines, yet at least at first glance, seems like an eminently solvable problem.

link

ml-anon 200 days ago

So solve it?

link