Hacker News new | ask | show | jobs
by feoren 41 days ago
The final output of the neural network part of an LLM is a vector with weights for every token, that is then usually softmaxed and picked from. Can we not quantify the uncertainty by looking at the distribution of weights of the top 10 options? Like we expect for a note-taking app that the top choice would be something like 98% certain, and if we see that the model gives a weight of 60% to "Russia" and 30% to "France", that's just not enough certainty to simply output "Russia". That's exactly when it should say "<uncertain>" or something instead.
3 comments

I’ve looked at confidence outputs for the chosen words from several STT providers and it’s definitely so that low confidence indicate that there is a risk that it has misheard.

Not always though. Let’s say that someone is saying ”1 2 3 4 <unintelligible> 6 7 8” then it will happily write 5 in the middle and give it good confidence as based on the context, it is the only likely word. Varies between TTS providers though.

Basically, why they are so good in average is that they estimate what is said most often based on the context. The context being then not only the audio but what was transcribed previously.

And if you don’t want it to be based on what is most likely to be said in context and only based on the audio around 1 word it is going to be awfully wrong most of the time.

It seems like the problem in this application is that attention itself. Makes me wonder if using a transformer for transcription is the correct architecture.
Unfortunately, that likely just doesn't exist. Everything suggests that these models are confident about their mistakes.
I mean, what I describe absolutely does exist, that's how LLMs work. The question is whether the relative weights are actually a good measure of confidence, and as the other reply to my comment points out, there are examples where it's not -- at least not the kind of "confidence" we really want.
I think it might break the game. Most words sound similar enough to other words. "cat" and "get", "he simply" and "his simply", etc.

Add accents, and half the words would be indistinguishable from each other (note that word "indistinguishable", ironically, would be quite distinguishable).

People parse things like that in so much context, based in their own understanding of a situation, their grasp on speakers accent or speech impairments, etc.

Add to that that most native english speakers blur words together. The pause that in some languages is used to separate words, is used in english to separate sentences. English language as spoken doesn't separate words natively.

The text-to-speech before LLMs was meh. I think it's the ability to generate filler for uncertain words that makes it feel magic compared to before.