Hacker News new | ask | show | jobs
by Jadiker 651 days ago
It looks like the hallucination score is somewhat related to perplexity in the sense that it relies on specific tokens. This could cause issues because rephrasing or using slightly different terms could lead to a higher hallucination score. E.g. if the correct answer is "John Smith is the world's best baker" then "Mary Kay is the world's best baker" would have a better score (lower hallucination) than "Leading maker of baked items across all the continents: John Smith" according to your metric.

Are there any plans to make updates to this score or add in different metrics for more accurately detecting hallucinations that don't penalize rephrasing?

1 comments

Thanks for the well-thought out question Jadiker!

This is a potential limitation of N-gram precision with context matching, which we were using in the RAG demo for simplicity (though even with this, I don't think it would be so extreme :-) )

We already offer two other different hallucination detection approaches which should mitigate this problem - an LLM-as-a-judge model for evaluation, and semantic similarity matching. We've also considered, for example, using metrics such as BertScore. Do you have other ideas? :-)