Hacker News new | ask | show | jobs
by ggnore7452 603 days ago
What’s more interesting to me here are the calibration graphs:

• LLMs, at least GPT models, tend to overstate their confidence. • A frequency-based approach appears to achieve calibration closer to the ideal.

This kinda passes my vibe test. That said, I wonder—rather than running 100 trials, could we approximate this by using something like a log-probability ratio? This would especially apply in cases where answers are yes or no, assuming the output spans more than one token.

2 comments

If you imagine a future where LLMs get faster and cheaper even without getting better it means we'd be able to automatically repeat questions 100x and every answer could come with a pretty good confidence measure.
yeah, this is by far the most interesting part of this page, the fact that LLMs can know what they know is not a trivial fact.