|
|
|
|
|
by ggnore7452
603 days ago
|
|
What’s more interesting to me here are the calibration graphs: • LLMs, at least GPT models, tend to overstate their confidence.
• A frequency-based approach appears to achieve calibration closer to the ideal. This kinda passes my vibe test. That said, I wonder—rather than running 100 trials, could we approximate this by using something like a log-probability ratio? This would especially apply in cases where answers are yes or no, assuming the output spans more than one token. |
|