|
|
|
|
|
by reexpressionist
897 days ago
|
|
The alternative approach is to start with a small[er] model, but derive reliable uncertainty estimates, only moving to a larger model if necessary (i.e., if the probability of the predictions is lower than needed for the task). And I agree that the leaderboards don't currently reflect the quantities of interest typically needed in practice. |
|
That is very, very hard to do in an objective manner, as the current LLM benchmark gaming demonstrates.
Sure, you can deploy a smaller model to production to get real-world user data and feedback, but a) deploying a suboptimal model can give a bad first impression and b) the quality is still subjective and requires other metrics to be analyzed. Looking at prediction probabilities only really helps if you have a single correct output token, which isn't what LLM benchmarks test for.