Hacker News new | ask | show | jobs
by user_7832 41 days ago
> If they miss a word they never do unintelligible, they just start playing madlibs based on the rest of the sentence.

Imo this is the single biggest flaw of LLMs. They're great at a lot of things, but knowing when they're wrong (or don't have enough information to actually work on) is a critical flaw.

IMO there's nothing structural about why they shouldn't be able to spot this and correct themselves - I suspect it's a training issue. But presumably bots that infer context/fill in the dots rank better on what people like... at the cost of accuracy.

3 comments

I don't think it's a training issue, it's simply that there's no inherent "I don't know" in the transformer architecture unless it's really like something completely unknown, otherwise the nearest neighbor will be chosen and that will be whatever sounds similar or is relevant, even if it might cause a problem
The final output of the neural network part of an LLM is a vector with weights for every token, that is then usually softmaxed and picked from. Can we not quantify the uncertainty by looking at the distribution of weights of the top 10 options? Like we expect for a note-taking app that the top choice would be something like 98% certain, and if we see that the model gives a weight of 60% to "Russia" and 30% to "France", that's just not enough certainty to simply output "Russia". That's exactly when it should say "<uncertain>" or something instead.
I’ve looked at confidence outputs for the chosen words from several STT providers and it’s definitely so that low confidence indicate that there is a risk that it has misheard.

Not always though. Let’s say that someone is saying ”1 2 3 4 <unintelligible> 6 7 8” then it will happily write 5 in the middle and give it good confidence as based on the context, it is the only likely word. Varies between TTS providers though.

Basically, why they are so good in average is that they estimate what is said most often based on the context. The context being then not only the audio but what was transcribed previously.

And if you don’t want it to be based on what is most likely to be said in context and only based on the audio around 1 word it is going to be awfully wrong most of the time.

It seems like the problem in this application is that attention itself. Makes me wonder if using a transformer for transcription is the correct architecture.
Unfortunately, that likely just doesn't exist. Everything suggests that these models are confident about their mistakes.
I mean, what I describe absolutely does exist, that's how LLMs work. The question is whether the relative weights are actually a good measure of confidence, and as the other reply to my comment points out, there are examples where it's not -- at least not the kind of "confidence" we really want.
I think it might break the game. Most words sound similar enough to other words. "cat" and "get", "he simply" and "his simply", etc.

Add accents, and half the words would be indistinguishable from each other (note that word "indistinguishable", ironically, would be quite distinguishable).

People parse things like that in so much context, based in their own understanding of a situation, their grasp on speakers accent or speech impairments, etc.

Add to that that most native english speakers blur words together. The pause that in some languages is used to separate words, is used in english to separate sentences. English language as spoken doesn't separate words natively.

The text-to-speech before LLMs was meh. I think it's the ability to generate filler for uncertain words that makes it feel magic compared to before.

Not inherent in transformer architecture, we do try to ingrain a sense of uncertainty but it’s difficult not only technically but also philosophically/culturally. How confident do you want the model to be in its answer to “why did Rome fall”?

Lots of tools in our toolbelts to do better uncertainty calibration but it trades off against other capabilities and actually can be rather frustrating to interact with in agentic contexts since it will constantly need input from you or otherwise be indecisive and overly cautious. It’s not technically a limitation of transformer architecture but it is more challenging to deal with than other architectures/statistical paradigms.

Like you can maintain a belief state and generate conditional on this and train to ensure belief state is stable and performant. But evals reward guessing at this point, and it’s very very hard to evaluate the calibration in these open ended contexts. But we’re slowly getting there, just not nearly as fast as other capabilities.

>How confident do you want the model to be in its answer to “why did Rome fall”?

The confidence level can be any, as long as it's reported accurately often enough. "This is my conjecture, but", "I'm not completely sure, but", and "most historians agree that" are all perfectly valid ways to start a sentence, which LLMs never use. They state mathematical truth, general consensus, hotly debated stances, and total fabrication, with the exact same assertiveness.

> > Like you can maintain a belief state and generate conditional on this and train to ensure belief state is stable and performant

> ways to start a sentence, which LLMs never use

A huge part of the problem is we've invented a document-generator setup which exploits human cognitive illusions, and even the smartest person can't constantly override the instinctive brain-bits that "sees" fictional entities and infers the intent of a mind. That makes it weirdly-hard to discuss the setup's shortfalls or how to improve it.

To wit: The machine does not possess any kind of confidence about how Rome fell. Or even whether Rome fell. It has "confidence" about which word/token will next in a "typical" document given the document-so-far has text like "How did Rome fall?" It may be straightforward to burn money training the system so that its "typical" story never has a computer-character with confident words about Roman history, but that's just papering over the underlying problem.

TLDR: We can't fix the thinking-habits or beliefs inside the mind of an entity that doesn't actually exist. Changing the story-generator to contain a tee-totaling Dracula dispensing life-advice doesn't mean we "cured the disease of vampirism."

IIRC people actually measured it, and one of the things RLHF does is to turn the fairly well-calibrated probability judgments of the raw predictive model into an essentially binary and much more inaccurate “definitely” / “no idea, coin toss”, the former member of the pair being of course much more frequent. The architecture is perfectly capable of uncertainty, it’s the humans that hate it and sand the capability off until the result fits their preconceptions.

(Which is intensely depressing to a human that doesn’t.)

I feel like if you trained better for "I don't know", it would drag down competence everywhere else somehow. Like, the strength of a model is exactly it's ability to grasp at straws and very often find the right one.

If you ask a good model something that makes no sense, it will tell you it makes no sense and it can't answer the question; so I know it's possible.

Surely they could be built to pit placeholders for low confidence predictions and ignore those bits when predicting the rest?

The reason AI companies won’t do this of course is it would completely ruin the illusion of confident confidence these machines project.

The thing is, if LLMs are stochastic parrots predicting the next word (aka, a partially decent auto complete), there's no reason it can't complete <specific question it can't answer> as "I don't know" - as that's a perfectly valid sentence too.

That's why I'm still cautiously optimistic about LLMs somewhere being good enough. I don't know if or when someone will manage to do it, but I'm hopeful.

Damn, did I say something wrong or unpopular to get a downvote?
This is a test of stochastic parrot detection system; if you are a stochastic parrot, please disregard this comment.
I am too sleepy to understand this comment :(

Do stochastic parrots dream of the number of 'e's in "electric sheep"?

AI models moved beyond next word predictors recently. Considering them to just be partially decent auto complete is completely missing many recent innovations.
It's a benchmark and eval issue. Guessing gets them the right result sometimes and the models rank better in error rate than they'd otherwise. We need the kind of benchmarks that penalize being wrong WAY more than saying "I don't know".

Of course there's a secondary problem that the model may then overuse the unintelligible option, but that's something that's a matter of training them properly against that eval.

You could also try thresholding the output based on perplexity to remove the parts that the model is less sure about, but that's not going to be super accurate I think.

Benchmarking for giving I don't know rather than wrong answer seems to be the right way to steer industry towards making models that are good at this. AA-Omniscience is one such benchmark.

AA-Omniscience is a knowledge and hallucination benchmark that rewards accuracy, punishes bad guesses and provides a comprehensive view of which models produce factually reliable outputs across different domains. The benchmark contains 6,000 questions across 6 major domains, derived from authoritative academic and industry sources and generated automatically using an LLM-based question generation agent to ensure unambiguity, scalability and factual precision

https://artificialanalysis.ai/evaluations/omniscience

Yeah I broadly agree with you. I've tried by explicitly adding a prompt to "ask questions and clarify", and even fairly decent models like Gemini pro (2.5 or 3) tend to make question for the sake of it.

Which reminds me that that's another big issue with LLMs - they'll blindly do whatever you ask them to, without pushback. (Again, I miss 3.5/3.6 era Sonnet which actually had half a spine. Fuck anthropic for blindly chasing coding benchmarks at the cost of everything else.)

I've engaged in several "CMVs" (or "tell me why X is bad") with LLMs, and very often it's clear it's just saying stuff to say it, giving very terrible points on unjustifiable positions that collapse the moment I counter argue even slightly rationally.

It's just a token predictor what do you expect? What we need are tools that embrace that and ping the agent to validate what it just said or double check. But the trade off is that this might hamper their capabilities to some level
> It's just a token predictor what do you expect?

The point isn't that it's unexpected. It's that prior text-to-speech systems were much better about this particular failure mode, prone to spitting out entirely incorrect words but not rephrasing entire sentences.

This is a particularly bad failure mode because people don't notice it.

> What we need are tools that embrace that and ping the agent to validate what it just said or double check.

This is not a problem that can be fixed by throwing more AI at it. It's a shared problem to all such systems, whether they're audio-text transformers or LLMs. Agentic review would just further push the system towards creating output that looks correct, but is not.

LLM translation does the same, yielding more natural text, but generally not better translation. In several cases, especially the "easy" translation between similar languages (e.g. within a language group like Germanic or Nordic) LLM-powered translation is notably worse than more primitive "word & phrase book" systems, tending to change the meaning of the text in order to have good grammar whereas these older systems would give crude or grammatically incorrect translations that still retained the core meaning.

I often (ish) translate between English and German, two languages I speak very well. The quality of translation is amazing and far better than what old systems did.

Maybe it depends on topics or length, for me it's usually 1-2 paragraphs of a German article to share online.

> The quality of translation is amazing and far better than what old systems did.

Are you native in both languages? If you are only native in one of them, it would be insightful to find if people with your skillset but native in the language you are not have the same opinion as you.

It’s rather unlikely that the translation in one direction is great, but lacking in the other, while also being just good enough (compared to before) that my close-to-native English skill misses it, while the old google translate somehow magically made me think it was bad.

Sadly there are no examples here to compare.

> Maybe it depends on topics or length, for me it's usually 1-2 paragraphs of a German article to share online.

Same languages, same use case. My experience is different. On both google translate and others. ¯\_(ツ)_/¯

Haven’t used google translate in a long time, mostly because of quality issues before LLMs. Deepl was leading for a while, nowadays I’m very happy with Kagi translate.
Older ML systems were much better at exposing their internal confidence. Plenty of papers reverse out this kind of interpretability for open weight models. All the models exposed logprobs early on. This seems solvable if prioritized. The unintelligible words should be lower confidence. Getting per-token data for the output that aids with understanding the predictions is entirely feasible as engineering effort - it just won't be enough to address all the problems - but it should help quite a bit.
While you're correct in what tthe audio models are - at least somewhat (they're not exactly like text based llms), you seem to brush his point away too quickly before fully exploring it.

This is a solvable issue, the current model and harnesses just aren't made with that assumption - hence they're doing "best effort while guessing if unsure".

Give it a few more months to years and things will likely settle how he pitched - at least in the context of note taking: only let it become "lore" if it didn't have to guess a word.

Currently there is basically only one mode - and it's optimized for conversation. The note taking is just glued on with that functionality as the backbone, and that's probably not going to stay.

> Give it a few more months to years and things will likely settle how he pitched - at least in the context of note taking: only let it become "lore" if it didn't have to guess a word.

I'm hesitant to admit even that. Like any computational linguistics problem, accuracy relies on coverages of all levels: form morphology, through syntax and semantics to speech act and world knowledge.

I worked with state of art speech recognition in healthcare setting. The model was specifically trained on small set of languages with emphasis on covering medical terminology.

It worked great for conversations most of the time, but sometimes messed up very badly. For instance when patient would mention the name of a relative, a street address or phone number. Spelling out an email address would mess it up completely.

It's just like when you're a horrible typist and rely on spell checking: The red squibles are gone, but the story no longer makes sense. Or when you "autofix" a syntax error, but the meaning diverges from your intention.

As the technology improved the number of words decreases, but the mistakes get more severe.

> what do you expect?

If the prediction strength is below X, put an indicator that it couldn't make a valid prediction?

>It's just a token predictor what do you expect?

Someone tell Altman