|
|
|
|
|
by _jonas
601 days ago
|
|
Here are some benchmarks I ran that compare the precision/recall of various LLM error-detection methods, including logprobs and LLM self-evaluation / verbalized confidence: https://cleanlab.ai/blog/4o-claude/ These approaches can detect errors better than random guessing, but there are other approaches that are significantly more effective in practice. |
|