> I'm not sure what you mean by "that doesn't work in practice" re: using logits.
Like the other comment points out, models today aren't "calibrated" to give that kind of information. More precisely, we aren't training our models to explicitly tell us how confident they are in their predictions. They're simply trained to give the predictions that result in the lowest average error over the training dataset.
For example, we can consider the simple task of recognizing the words "yes" or "no". A naive model could return (0%, 100%) all the time (always guess "no") and, if the dataset is balanced, would get a score of 50%. Another naive model could return (50%, 50%) all the time and get the same score, 50%. Yet in practice we'd rather have the latter model because it better expresses that model's level of confidence. The former model, even though it gets the same average error rate, expresses a level of confidence in its answers that isn't there.
As of today, we only train models on the overall error rate, so our training methods don't prefer one kind of output over the other. That's why measuring the logits to guesstimate confidence isn't actually a good metric. It just happens to accidentally be one sometimes.
A speech recognition model might get to the same WER as a human, but humans are keenly aware of when they didn't hear a word right. That's invaluable information to a food ordering system which can then respond by asking for clarification, rather than blindly following its "best guess" which results in the aforementioned ordering of 40 ketchup packets.
And as far as I'm aware there are no loss functions for training confidence measurements into a system, so this is very much an unsolved problem in speech recognition systems.
> I'm not sure what you mean by "that doesn't work in practice" re: using logits.
I suspect they mean they ML models are usually poorly calibrated and that the softmax-over-logits probabilities generally don't reflect actually error rates, so they're tough to use meaningfully for asking people to repeat themselves.
Personally, if I have to deal with an automated order system, I'd rather some kind of search tree that let's me traverse it using three (left, right, back) well separated noises and a "dumb" back-end instead of having to pretend a ML system and I are having the meeting of the minds that a voice based discussion implies.
I understand that such a system would be hard or impossible to train lay-people to use, but it would be nice to have a "cut the crap" option to let people interface more effectively with the order system and not take part in the charade of a "discussion"
Like the other comment points out, models today aren't "calibrated" to give that kind of information. More precisely, we aren't training our models to explicitly tell us how confident they are in their predictions. They're simply trained to give the predictions that result in the lowest average error over the training dataset.
For example, we can consider the simple task of recognizing the words "yes" or "no". A naive model could return (0%, 100%) all the time (always guess "no") and, if the dataset is balanced, would get a score of 50%. Another naive model could return (50%, 50%) all the time and get the same score, 50%. Yet in practice we'd rather have the latter model because it better expresses that model's level of confidence. The former model, even though it gets the same average error rate, expresses a level of confidence in its answers that isn't there.
As of today, we only train models on the overall error rate, so our training methods don't prefer one kind of output over the other. That's why measuring the logits to guesstimate confidence isn't actually a good metric. It just happens to accidentally be one sometimes.
A speech recognition model might get to the same WER as a human, but humans are keenly aware of when they didn't hear a word right. That's invaluable information to a food ordering system which can then respond by asking for clarification, rather than blindly following its "best guess" which results in the aforementioned ordering of 40 ketchup packets.
And as far as I'm aware there are no loss functions for training confidence measurements into a system, so this is very much an unsolved problem in speech recognition systems.