|
|
|
|
|
by jordanpg
200 days ago
|
|
Does anyone know, from a technical standpoint, why are citations such a problem for LLMs? I realize things are probably (much) more complicated than I realize, but programmatically, unlike arbitrary text, citations are generally strings with a well-defined format. There are literally "specs" for citation formats in various academic, legal, and scientific fields. So, naively, one way to mitigate these hallucinations would be identify citations with a bunch of regexes, and if one is spotted, use the Google Scholar API (or whatever) to make sure it's real. If not, delete it or flag it, etc. Why isn't something like this obvious solution being done? My guess is that it would slow things down too much. But it could be optional and it could also be done after the output is generated by another process. |
|
There are some mitigations that are used such as RAG or tool usage (e.g. a browser), but they don't completely fix the underlying issue.