| > I'm not sure what you mean by "that doesn't work in practice" re: using logits. Like the other comment points out, models today aren't "calibrated" to give that kind of information. More precisely, we aren't training our models to explicitly tell us how confident they are in their predictions. They're simply trained to give the predictions that result in the lowest average error over the training dataset. For example, we can consider the simple task of recognizing the words "yes" or "no". A naive model could return (0%, 100%) all the time (always guess "no") and, if the dataset is balanced, would get a score of 50%. Another naive model could return (50%, 50%) all the time and get the same score, 50%. Yet in practice we'd rather have the latter model because it better expresses that model's level of confidence. The former model, even though it gets the same average error rate, expresses a level of confidence in its answers that isn't there. As of today, we only train models on the overall error rate, so our training methods don't prefer one kind of output over the other. That's why measuring the logits to guesstimate confidence isn't actually a good metric. It just happens to accidentally be one sometimes. A speech recognition model might get to the same WER as a human, but humans are keenly aware of when they didn't hear a word right. That's invaluable information to a food ordering system which can then respond by asking for clarification, rather than blindly following its "best guess" which results in the aforementioned ordering of 40 ketchup packets. And as far as I'm aware there are no loss functions for training confidence measurements into a system, so this is very much an unsolved problem in speech recognition systems. |