Hacker News new | ask | show | jobs
by r_lee 37 days ago
I don't think it's a training issue, it's simply that there's no inherent "I don't know" in the transformer architecture unless it's really like something completely unknown, otherwise the nearest neighbor will be chosen and that will be whatever sounds similar or is relevant, even if it might cause a problem
6 comments

The final output of the neural network part of an LLM is a vector with weights for every token, that is then usually softmaxed and picked from. Can we not quantify the uncertainty by looking at the distribution of weights of the top 10 options? Like we expect for a note-taking app that the top choice would be something like 98% certain, and if we see that the model gives a weight of 60% to "Russia" and 30% to "France", that's just not enough certainty to simply output "Russia". That's exactly when it should say "<uncertain>" or something instead.
I’ve looked at confidence outputs for the chosen words from several STT providers and it’s definitely so that low confidence indicate that there is a risk that it has misheard.

Not always though. Let’s say that someone is saying ”1 2 3 4 <unintelligible> 6 7 8” then it will happily write 5 in the middle and give it good confidence as based on the context, it is the only likely word. Varies between TTS providers though.

Basically, why they are so good in average is that they estimate what is said most often based on the context. The context being then not only the audio but what was transcribed previously.

And if you don’t want it to be based on what is most likely to be said in context and only based on the audio around 1 word it is going to be awfully wrong most of the time.

It seems like the problem in this application is that attention itself. Makes me wonder if using a transformer for transcription is the correct architecture.
Unfortunately, that likely just doesn't exist. Everything suggests that these models are confident about their mistakes.
I mean, what I describe absolutely does exist, that's how LLMs work. The question is whether the relative weights are actually a good measure of confidence, and as the other reply to my comment points out, there are examples where it's not -- at least not the kind of "confidence" we really want.
I think it might break the game. Most words sound similar enough to other words. "cat" and "get", "he simply" and "his simply", etc.

Add accents, and half the words would be indistinguishable from each other (note that word "indistinguishable", ironically, would be quite distinguishable).

People parse things like that in so much context, based in their own understanding of a situation, their grasp on speakers accent or speech impairments, etc.

Add to that that most native english speakers blur words together. The pause that in some languages is used to separate words, is used in english to separate sentences. English language as spoken doesn't separate words natively.

The text-to-speech before LLMs was meh. I think it's the ability to generate filler for uncertain words that makes it feel magic compared to before.

Not inherent in transformer architecture, we do try to ingrain a sense of uncertainty but it’s difficult not only technically but also philosophically/culturally. How confident do you want the model to be in its answer to “why did Rome fall”?

Lots of tools in our toolbelts to do better uncertainty calibration but it trades off against other capabilities and actually can be rather frustrating to interact with in agentic contexts since it will constantly need input from you or otherwise be indecisive and overly cautious. It’s not technically a limitation of transformer architecture but it is more challenging to deal with than other architectures/statistical paradigms.

Like you can maintain a belief state and generate conditional on this and train to ensure belief state is stable and performant. But evals reward guessing at this point, and it’s very very hard to evaluate the calibration in these open ended contexts. But we’re slowly getting there, just not nearly as fast as other capabilities.

>How confident do you want the model to be in its answer to “why did Rome fall”?

The confidence level can be any, as long as it's reported accurately often enough. "This is my conjecture, but", "I'm not completely sure, but", and "most historians agree that" are all perfectly valid ways to start a sentence, which LLMs never use. They state mathematical truth, general consensus, hotly debated stances, and total fabrication, with the exact same assertiveness.

> > Like you can maintain a belief state and generate conditional on this and train to ensure belief state is stable and performant

> ways to start a sentence, which LLMs never use

A huge part of the problem is we've invented a document-generator setup which exploits human cognitive illusions, and even the smartest person can't constantly override the instinctive brain-bits that "sees" fictional entities and infers the intent of a mind. That makes it weirdly-hard to discuss the setup's shortfalls or how to improve it.

To wit: The machine does not possess any kind of confidence about how Rome fell. Or even whether Rome fell. It has "confidence" about which word/token will next in a "typical" document given the document-so-far has text like "How did Rome fall?" It may be straightforward to burn money training the system so that its "typical" story never has a computer-character with confident words about Roman history, but that's just papering over the underlying problem.

TLDR: We can't fix the thinking-habits or beliefs inside the mind of an entity that doesn't actually exist. Changing the story-generator to contain a tee-totaling Dracula dispensing life-advice doesn't mean we "cured the disease of vampirism."

IIRC people actually measured it, and one of the things RLHF does is to turn the fairly well-calibrated probability judgments of the raw predictive model into an essentially binary and much more inaccurate “definitely” / “no idea, coin toss”, the former member of the pair being of course much more frequent. The architecture is perfectly capable of uncertainty, it’s the humans that hate it and sand the capability off until the result fits their preconceptions.

(Which is intensely depressing to a human that doesn’t.)

I feel like if you trained better for "I don't know", it would drag down competence everywhere else somehow. Like, the strength of a model is exactly it's ability to grasp at straws and very often find the right one.

If you ask a good model something that makes no sense, it will tell you it makes no sense and it can't answer the question; so I know it's possible.

Surely they could be built to pit placeholders for low confidence predictions and ignore those bits when predicting the rest?

The reason AI companies won’t do this of course is it would completely ruin the illusion of confident confidence these machines project.

The thing is, if LLMs are stochastic parrots predicting the next word (aka, a partially decent auto complete), there's no reason it can't complete <specific question it can't answer> as "I don't know" - as that's a perfectly valid sentence too.

That's why I'm still cautiously optimistic about LLMs somewhere being good enough. I don't know if or when someone will manage to do it, but I'm hopeful.

Damn, did I say something wrong or unpopular to get a downvote?
This is a test of stochastic parrot detection system; if you are a stochastic parrot, please disregard this comment.
I am too sleepy to understand this comment :(

Do stochastic parrots dream of the number of 'e's in "electric sheep"?

AI models moved beyond next word predictors recently. Considering them to just be partially decent auto complete is completely missing many recent innovations.