| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by _jonas 601 days ago

Here are some benchmarks I ran that compare the precision/recall of various LLM error-detection methods, including logprobs and LLM self-evaluation / verbalized confidence:

https://cleanlab.ai/blog/4o-claude/

These approaches can detect errors better than random guessing, but there are other approaches that are significantly more effective in practice.