|
|
|
|
|
by feoren
41 days ago
|
|
The final output of the neural network part of an LLM is a vector with weights for every token, that is then usually softmaxed and picked from. Can we not quantify the uncertainty by looking at the distribution of weights of the top 10 options? Like we expect for a note-taking app that the top choice would be something like 98% certain, and if we see that the model gives a weight of 60% to "Russia" and 30% to "France", that's just not enough certainty to simply output "Russia". That's exactly when it should say "<uncertain>" or something instead. |
|
Not always though. Let’s say that someone is saying ”1 2 3 4 <unintelligible> 6 7 8” then it will happily write 5 in the middle and give it good confidence as based on the context, it is the only likely word. Varies between TTS providers though.
Basically, why they are so good in average is that they estimate what is said most often based on the context. The context being then not only the audio but what was transcribed previously.
And if you don’t want it to be based on what is most likely to be said in context and only based on the audio around 1 word it is going to be awfully wrong most of the time.