|
With all due respect and while wishing you best of luck, it's always a bit worrisome when generative AI is used in the real world with real consequences... In my experience, what LLMs, even some of the most advanced ones (o1, Gemini 1.5) are really good at is rationalization after the fact: explaining why they were right, even when presented with direct evidence to the contrary. I just ran an experiment trying to get various models put footnote references in the OCR of a text, based on the content of the footnotes. I tested 120+ different models via OpenRouter; they all failed, but the "best" ones failed in a very bizarre and I think, dangerous way: they made up some text to better fit the footnote references! And then they lied about it, saying in a "summary" paragraph that no text had been changed, and/or that they had indeed been able to place all references. So I guess my question is: how do you detect and flag hallucinations? |