Hacker News new | ask | show | jobs
by coffeebeqn 38 days ago
Plus they are super inaccurate. Gemini gets one of its three bullet subtly or very majorly wrong almost every time. Just a few weeks ago Gemini said we’re rolling out our payment setup in Russia. You know the place where we have 20+ sanctions packages on? We were talking about France in the meeting.
5 comments

We've found they're surprisingly good if everyone on the call is using a decent headset.

The problems start when using conference room audio or someone is on their laptop mic. If they miss a word they never do unintelligible, they just start playing madlibs based on the rest of the sentence.

We just went through a round of 100+ (non-sensitive) VoC interviews and they really cut down the workload of compiling all of the feedback. If the audio was a little shaky though, we pretty much had to throw away the transcripts and do them from scratch like we used to.

> If they miss a word they never do unintelligible, they just start playing madlibs based on the rest of the sentence.

Imo this is the single biggest flaw of LLMs. They're great at a lot of things, but knowing when they're wrong (or don't have enough information to actually work on) is a critical flaw.

IMO there's nothing structural about why they shouldn't be able to spot this and correct themselves - I suspect it's a training issue. But presumably bots that infer context/fill in the dots rank better on what people like... at the cost of accuracy.

I don't think it's a training issue, it's simply that there's no inherent "I don't know" in the transformer architecture unless it's really like something completely unknown, otherwise the nearest neighbor will be chosen and that will be whatever sounds similar or is relevant, even if it might cause a problem
The final output of the neural network part of an LLM is a vector with weights for every token, that is then usually softmaxed and picked from. Can we not quantify the uncertainty by looking at the distribution of weights of the top 10 options? Like we expect for a note-taking app that the top choice would be something like 98% certain, and if we see that the model gives a weight of 60% to "Russia" and 30% to "France", that's just not enough certainty to simply output "Russia". That's exactly when it should say "<uncertain>" or something instead.
I’ve looked at confidence outputs for the chosen words from several STT providers and it’s definitely so that low confidence indicate that there is a risk that it has misheard.

Not always though. Let’s say that someone is saying ”1 2 3 4 <unintelligible> 6 7 8” then it will happily write 5 in the middle and give it good confidence as based on the context, it is the only likely word. Varies between TTS providers though.

Basically, why they are so good in average is that they estimate what is said most often based on the context. The context being then not only the audio but what was transcribed previously.

And if you don’t want it to be based on what is most likely to be said in context and only based on the audio around 1 word it is going to be awfully wrong most of the time.

It seems like the problem in this application is that attention itself. Makes me wonder if using a transformer for transcription is the correct architecture.
Unfortunately, that likely just doesn't exist. Everything suggests that these models are confident about their mistakes.
I mean, what I describe absolutely does exist, that's how LLMs work. The question is whether the relative weights are actually a good measure of confidence, and as the other reply to my comment points out, there are examples where it's not -- at least not the kind of "confidence" we really want.
I think it might break the game. Most words sound similar enough to other words. "cat" and "get", "he simply" and "his simply", etc.

Add accents, and half the words would be indistinguishable from each other (note that word "indistinguishable", ironically, would be quite distinguishable).

People parse things like that in so much context, based in their own understanding of a situation, their grasp on speakers accent or speech impairments, etc.

Add to that that most native english speakers blur words together. The pause that in some languages is used to separate words, is used in english to separate sentences. English language as spoken doesn't separate words natively.

The text-to-speech before LLMs was meh. I think it's the ability to generate filler for uncertain words that makes it feel magic compared to before.

Not inherent in transformer architecture, we do try to ingrain a sense of uncertainty but it’s difficult not only technically but also philosophically/culturally. How confident do you want the model to be in its answer to “why did Rome fall”?

Lots of tools in our toolbelts to do better uncertainty calibration but it trades off against other capabilities and actually can be rather frustrating to interact with in agentic contexts since it will constantly need input from you or otherwise be indecisive and overly cautious. It’s not technically a limitation of transformer architecture but it is more challenging to deal with than other architectures/statistical paradigms.

Like you can maintain a belief state and generate conditional on this and train to ensure belief state is stable and performant. But evals reward guessing at this point, and it’s very very hard to evaluate the calibration in these open ended contexts. But we’re slowly getting there, just not nearly as fast as other capabilities.

>How confident do you want the model to be in its answer to “why did Rome fall”?

The confidence level can be any, as long as it's reported accurately often enough. "This is my conjecture, but", "I'm not completely sure, but", and "most historians agree that" are all perfectly valid ways to start a sentence, which LLMs never use. They state mathematical truth, general consensus, hotly debated stances, and total fabrication, with the exact same assertiveness.

> > Like you can maintain a belief state and generate conditional on this and train to ensure belief state is stable and performant

> ways to start a sentence, which LLMs never use

A huge part of the problem is we've invented a document-generator setup which exploits human cognitive illusions, and even the smartest person can't constantly override the instinctive brain-bits that "sees" fictional entities and infers the intent of a mind. That makes it weirdly-hard to discuss the setup's shortfalls or how to improve it.

To wit: The machine does not possess any kind of confidence about how Rome fell. Or even whether Rome fell. It has "confidence" about which word/token will next in a "typical" document given the document-so-far has text like "How did Rome fall?" It may be straightforward to burn money training the system so that its "typical" story never has a computer-character with confident words about Roman history, but that's just papering over the underlying problem.

TLDR: We can't fix the thinking-habits or beliefs inside the mind of an entity that doesn't actually exist. Changing the story-generator to contain a tee-totaling Dracula dispensing life-advice doesn't mean we "cured the disease of vampirism."

IIRC people actually measured it, and one of the things RLHF does is to turn the fairly well-calibrated probability judgments of the raw predictive model into an essentially binary and much more inaccurate “definitely” / “no idea, coin toss”, the former member of the pair being of course much more frequent. The architecture is perfectly capable of uncertainty, it’s the humans that hate it and sand the capability off until the result fits their preconceptions.

(Which is intensely depressing to a human that doesn’t.)

I feel like if you trained better for "I don't know", it would drag down competence everywhere else somehow. Like, the strength of a model is exactly it's ability to grasp at straws and very often find the right one.

If you ask a good model something that makes no sense, it will tell you it makes no sense and it can't answer the question; so I know it's possible.

Surely they could be built to pit placeholders for low confidence predictions and ignore those bits when predicting the rest?

The reason AI companies won’t do this of course is it would completely ruin the illusion of confident confidence these machines project.

The thing is, if LLMs are stochastic parrots predicting the next word (aka, a partially decent auto complete), there's no reason it can't complete <specific question it can't answer> as "I don't know" - as that's a perfectly valid sentence too.

That's why I'm still cautiously optimistic about LLMs somewhere being good enough. I don't know if or when someone will manage to do it, but I'm hopeful.

Damn, did I say something wrong or unpopular to get a downvote?
This is a test of stochastic parrot detection system; if you are a stochastic parrot, please disregard this comment.
AI models moved beyond next word predictors recently. Considering them to just be partially decent auto complete is completely missing many recent innovations.
It's a benchmark and eval issue. Guessing gets them the right result sometimes and the models rank better in error rate than they'd otherwise. We need the kind of benchmarks that penalize being wrong WAY more than saying "I don't know".

Of course there's a secondary problem that the model may then overuse the unintelligible option, but that's something that's a matter of training them properly against that eval.

You could also try thresholding the output based on perplexity to remove the parts that the model is less sure about, but that's not going to be super accurate I think.

Benchmarking for giving I don't know rather than wrong answer seems to be the right way to steer industry towards making models that are good at this. AA-Omniscience is one such benchmark.

AA-Omniscience is a knowledge and hallucination benchmark that rewards accuracy, punishes bad guesses and provides a comprehensive view of which models produce factually reliable outputs across different domains. The benchmark contains 6,000 questions across 6 major domains, derived from authoritative academic and industry sources and generated automatically using an LLM-based question generation agent to ensure unambiguity, scalability and factual precision

https://artificialanalysis.ai/evaluations/omniscience

Yeah I broadly agree with you. I've tried by explicitly adding a prompt to "ask questions and clarify", and even fairly decent models like Gemini pro (2.5 or 3) tend to make question for the sake of it.

Which reminds me that that's another big issue with LLMs - they'll blindly do whatever you ask them to, without pushback. (Again, I miss 3.5/3.6 era Sonnet which actually had half a spine. Fuck anthropic for blindly chasing coding benchmarks at the cost of everything else.)

I've engaged in several "CMVs" (or "tell me why X is bad") with LLMs, and very often it's clear it's just saying stuff to say it, giving very terrible points on unjustifiable positions that collapse the moment I counter argue even slightly rationally.

It's just a token predictor what do you expect? What we need are tools that embrace that and ping the agent to validate what it just said or double check. But the trade off is that this might hamper their capabilities to some level
> It's just a token predictor what do you expect?

The point isn't that it's unexpected. It's that prior text-to-speech systems were much better about this particular failure mode, prone to spitting out entirely incorrect words but not rephrasing entire sentences.

This is a particularly bad failure mode because people don't notice it.

> What we need are tools that embrace that and ping the agent to validate what it just said or double check.

This is not a problem that can be fixed by throwing more AI at it. It's a shared problem to all such systems, whether they're audio-text transformers or LLMs. Agentic review would just further push the system towards creating output that looks correct, but is not.

LLM translation does the same, yielding more natural text, but generally not better translation. In several cases, especially the "easy" translation between similar languages (e.g. within a language group like Germanic or Nordic) LLM-powered translation is notably worse than more primitive "word & phrase book" systems, tending to change the meaning of the text in order to have good grammar whereas these older systems would give crude or grammatically incorrect translations that still retained the core meaning.

I often (ish) translate between English and German, two languages I speak very well. The quality of translation is amazing and far better than what old systems did.

Maybe it depends on topics or length, for me it's usually 1-2 paragraphs of a German article to share online.

> The quality of translation is amazing and far better than what old systems did.

Are you native in both languages? If you are only native in one of them, it would be insightful to find if people with your skillset but native in the language you are not have the same opinion as you.

> Maybe it depends on topics or length, for me it's usually 1-2 paragraphs of a German article to share online.

Same languages, same use case. My experience is different. On both google translate and others. ¯\_(ツ)_/¯

Older ML systems were much better at exposing their internal confidence. Plenty of papers reverse out this kind of interpretability for open weight models. All the models exposed logprobs early on. This seems solvable if prioritized. The unintelligible words should be lower confidence. Getting per-token data for the output that aids with understanding the predictions is entirely feasible as engineering effort - it just won't be enough to address all the problems - but it should help quite a bit.
While you're correct in what tthe audio models are - at least somewhat (they're not exactly like text based llms), you seem to brush his point away too quickly before fully exploring it.

This is a solvable issue, the current model and harnesses just aren't made with that assumption - hence they're doing "best effort while guessing if unsure".

Give it a few more months to years and things will likely settle how he pitched - at least in the context of note taking: only let it become "lore" if it didn't have to guess a word.

Currently there is basically only one mode - and it's optimized for conversation. The note taking is just glued on with that functionality as the backbone, and that's probably not going to stay.

> Give it a few more months to years and things will likely settle how he pitched - at least in the context of note taking: only let it become "lore" if it didn't have to guess a word.

I'm hesitant to admit even that. Like any computational linguistics problem, accuracy relies on coverages of all levels: form morphology, through syntax and semantics to speech act and world knowledge.

I worked with state of art speech recognition in healthcare setting. The model was specifically trained on small set of languages with emphasis on covering medical terminology.

It worked great for conversations most of the time, but sometimes messed up very badly. For instance when patient would mention the name of a relative, a street address or phone number. Spelling out an email address would mess it up completely.

It's just like when you're a horrible typist and rely on spell checking: The red squibles are gone, but the story no longer makes sense. Or when you "autofix" a syntax error, but the meaning diverges from your intention.

As the technology improved the number of words decreases, but the mistakes get more severe.

> what do you expect?

If the prediction strength is below X, put an indicator that it couldn't make a valid prediction?

>It's just a token predictor what do you expect?

Someone tell Altman

Recent example:

- the person said 8 to 10

- LLM transcribed as 18

Granted, the person had a foreign accent and didn't enunciate very clearly. But I knew they meant 8-10 if for no other reason than 18 didn't make sense given the context. But the AI isn't smart enough, and then 18 goes into the record.

My workflow uses krisp.ai for taking a transcript, and then I have a dedicated project in Claude. I feed it the transcript and ask for it to give me a summary in a specific format I define, with good front matter, etc., and it needs to spit that out in a way that I auto-import into Obsidian.

But key in my prompt is asking 1) for it to flag any low confidence or context-nonsensical statements in the transcript, with the timestamp, so then I can listen to the original audio and either clarify, correct, or say "I couldn't understand that either, here's my best guess and mark it low confidence", then 2) which I see as critical: Claude also is told to create a "context" document that it maintains based on my answers, so it starts to gather ASR things like "transcript commonly hears A B and C as variants of name X", who is who, internal product and project names and context info on them. 3) Claude is told specifically to read this prior to summarizing the transcript, and to consult it as it is doing so, and to ask me on anything it's not confident on.

What is then starting to get quite powerful for me is moving from full text search of my meeting notes in Obsidian (I'm a PM in a lot of meetings), but I can point Cowork to the Obsidian notes folder (because they're all Markdown) and start doing rich "querying" of it. "When did [stakeholder] first mention [feature] as a release blocker?" and it can point to the meeting.

My system works well, and I've done a bit to fine tune the automation and friction reduction, and it's a bit easier to manage because I'm not generally creating summaries for broader consumption but as my second brain (I have a separate prompt that utilizes some of that "knowledge" to build those).

One thing I've found helpful with this is moving the summarization itself into something with "context/memory". Krisp is capable of generating summaries but can't/doesn't review prior transcripts. Its role is just "give me the transcript as you heard it".

For the "when did X first mention Y" queries, does it surface things you'd forgotten about that turned out to matter?

Or mostly just confirm what you half-remembered?

Trying to figure out whether the value of the loop is rediscovery or just precise lookup.

I think it's both, for me. I struggle with recall, so it helps me remember. I'm also using it to push to Todoist, etc. I need the memory jog, and then it helps me confirm things as they come back to me, to ensure I remember correctly. (ADHD, and across many projects as a PM).
Their quality for different language accents also significantly varies.

Got a team with Indian, Chinese, Texan, British, and Australian? Your A.I.-powered translation tool is going to get 80% of your conversation wrong.

> headset

Half- vs. full duplex. Headphones is all you really need, though of course a directional mic and/or one closer to your mouth will yield a clearer audio recording as well.

I have done many transcriptions of messy meeting recordings with thic euro-english accents, and a local Whisper large handled them near perfect.
>If they miss a word they never do unintelligible, they just start playing madlibs based on the rest of the sentence.

Isn't that what people do?

For in-person conversations to keep the conversation flowing, sure, but any good transcription will say [unintelligible] when the scribe couldn't tell despite being able to listen over it again and again.

Nixon tapes for example: https://kagi.com/search?q=site%3Anixonlibrary.gov+%22unintel...

"This technology works as long as you're not a pleb"
> The problems start when using conference room audio

RTO problems

Verbatim transcriptions are usually very good. Because even the ocasional "can/can't" replacement is usually obvious within the context of the full conversation.

But the summarization feature is where the most ridiculous errors and omissions happens.

Ok zoom the default summary template is often lacking and incoherent, but switching it to the lengthy one “Discussion” works great. I think the default only works for single topic meetings where is rare.
Given how financial services can impose silent inexplicable lifetime bans for using the wrong words in the "what is this transaction for" field, I'm wondering at what point the AI automatically reports people for sanctions violation based on its mishearing.
That's presumably great for legal exposure because it increases deniability
I wonder what kind of GDPR implications that has given the requirements around the accuracy of personal data.