Hacker News new | ask | show | jobs
by fouc 1608 days ago
I would imagine GPT-3 or similar would be able to fix replace the garbled 1 out of 20 words with something that actually make sense in context.
3 comments

Yes, sort of. Thing is, many modern speech models actually learn an internal language model, so we're already kind of doing that. In languages and domains where massive amounts of training data is available (say, grammatically correct English), this internal language understanding is so good you don't need the external model[1].

On the other hand, throwing an additional language model like GPT and BERT into the mix can help if you don't have a ton of voice data. In my attempt to do this, a large portion of the improvement came from letting the language model read the previous sentences in the conversation[2]. AFAIK most commercial systems are blissfully unaware of your previous sentences, leading to conversations like "set an alarm"/"sure when?"/"eightam"/"your nearest ATM is...".

A word of caution though: letting BERT/GPT edit the outputs also gives a (potentially) much more dangerous failure mode: if the speech signal is difficult to understand, the resulting transcript will be difficult for humans to identify as transcription failures.

For example, "yeah, I dunno I haven't..." (read on a noisy phone line in an obscure dialect) was transcribed as "yeah yeah not that is I I am then" by the baseline speech system. After we let BERT edit the outputs, the transcript became "yeah that's not what I was saying...". Which, ironically, was definitely not what the person was saying.

[1] https://arxiv.org/abs/1911.08460, page 9

[2] https://arxiv.org/abs/2110.02267

edit: clarify why previous sentences matter

That seems worse to me. If there's going to be a transcription error I'd prefer it to be obvious instead of just changing the meaning of the sentence.
How do you know what word is garbled?
Grammar and context. It'd be closer to dictation than current speech to text, with gpt serving as a "brain" interpreting what you mean in the current context instead of raw input. You could tie in the "natural language to [sql,bash,log parse, regex]" capabilities of gpt-3 and so on.

Obviously it wouldn't be as good as a real person, but it'd be a nice leap to the 95%+ level of accuracy over the 80%ish on high performing commercial STT systems.

...and how do you know which word you meant (even if it's not garbled)?

The number of homonyms (and near-homonyms) in English in huge

It's been a major issue for some users of W3W (eg https://cybergibbons.com/security-2/why-what3words-is-not-su...)