|
|
|
|
|
by minimaxir
897 days ago
|
|
> derive reliable uncertainty estimates That is very, very hard to do in an objective manner, as the current LLM benchmark gaming demonstrates. Sure, you can deploy a smaller model to production to get real-world user data and feedback, but a) deploying a suboptimal model can give a bad first impression and b) the quality is still subjective and requires other metrics to be analyzed. Looking at prediction probabilities only really helps if you have a single correct output token, which isn't what LLM benchmarks test for. |
|
Hopefully in 2024 we can get at least one of the benchmarks to move to assessing non-parametric/distribution-free uncertainty for selective classification, reflecting more recent CS/Stats advances that should be used in practice. Working on it.