| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ggnore7452 603 days ago

What’s more interesting to me here are the calibration graphs:

• LLMs, at least GPT models, tend to overstate their confidence. • A frequency-based approach appears to achieve calibration closer to the ideal.

This kinda passes my vibe test. That said, I wonder—rather than running 100 trials, could we approximate this by using something like a log-probability ratio? This would especially apply in cases where answers are yes or no, assuming the output spans more than one token.

2 comments

ALittleLight 602 days ago

If you imagine a future where LLMs get faster and cheaper even without getting better it means we'd be able to automatically repeat questions 100x and every answer could come with a pretty good confidence measure.

link

GaggiX 603 days ago

yeah, this is by far the most interesting part of this page, the fact that LLMs can know what they know is not a trivial fact.

link