Hacker News new | ask | show | jobs
by PaulHoule 1608 days ago
Google and Siri are good at what they do. They aren't good at other things, such as dictation.

I see the big problem in voice interaction is that a human being will ask you questions to clarify what you said if they don't understand and current systems don't even try. (Actually the search paradigm lets you do some refinement, "Ok Google" works amazingly well on Android TV.)

Superhuman accuracy at dictation doesn't translate to a useful ability to understand text. You're doing great if you only garble 1 out of 20 words. Some errors are inconsequential, but if it garbles every other sentence then you are going to feel 0% understood.

3 comments

> Google and Siri are good at what they do. They aren't good at other things, such as dictation.

It's interesting you mention that google isn't good at dictation as I've found it excellent on pixel 6 (maybe the quality varies depending on what hardware you're running on?) If I need to write out anything over a sentence or two on my phone I'll almost always dictate it and as long as I have a reasonable idea what I want to say beforehand it works well.

What I personally find a little jarring is that I find I need to compose what I want in my head further in advance than I would if Im typing as correcting mistakes is more awkward.

Whenever I use the Google Assistant, I'm shocked by a) How good the speech-to-text is at figuring out my words, and b) How bad the application layer is at using those words

I tend to over-enunciate, so I don't get many bad bugs in the parsing... but that doesn't stop the Google Assistant from delivering completely the wrong response to the words that it's showing me it has correctly recognized, or simply spinning endlessly and locking up my phone.

As an industry, we suck at everything. We've solved the hard problem but failed the easy part of "once the command has been parsed, either execute the action or show the user an error and then close the dialog".

The only thing I find really awful about speech-to-text on Google is that it can't seem to detect punctuation.

When "OK Google" first came out, I was so wowed and I was constantly going "OK Google, search whatever". Now I use the button to trigger it because it doesn't hear me, and I have to retry a lot of queries -- it just doesn't work as well. Perhaps they made it work great for white males at first but then had to accept a bunch of tradeoffs to get it working for everyone.
I would imagine GPT-3 or similar would be able to fix replace the garbled 1 out of 20 words with something that actually make sense in context.
Yes, sort of. Thing is, many modern speech models actually learn an internal language model, so we're already kind of doing that. In languages and domains where massive amounts of training data is available (say, grammatically correct English), this internal language understanding is so good you don't need the external model[1].

On the other hand, throwing an additional language model like GPT and BERT into the mix can help if you don't have a ton of voice data. In my attempt to do this, a large portion of the improvement came from letting the language model read the previous sentences in the conversation[2]. AFAIK most commercial systems are blissfully unaware of your previous sentences, leading to conversations like "set an alarm"/"sure when?"/"eightam"/"your nearest ATM is...".

A word of caution though: letting BERT/GPT edit the outputs also gives a (potentially) much more dangerous failure mode: if the speech signal is difficult to understand, the resulting transcript will be difficult for humans to identify as transcription failures.

For example, "yeah, I dunno I haven't..." (read on a noisy phone line in an obscure dialect) was transcribed as "yeah yeah not that is I I am then" by the baseline speech system. After we let BERT edit the outputs, the transcript became "yeah that's not what I was saying...". Which, ironically, was definitely not what the person was saying.

[1] https://arxiv.org/abs/1911.08460, page 9

[2] https://arxiv.org/abs/2110.02267

edit: clarify why previous sentences matter

That seems worse to me. If there's going to be a transcription error I'd prefer it to be obvious instead of just changing the meaning of the sentence.
How do you know what word is garbled?
Grammar and context. It'd be closer to dictation than current speech to text, with gpt serving as a "brain" interpreting what you mean in the current context instead of raw input. You could tie in the "natural language to [sql,bash,log parse, regex]" capabilities of gpt-3 and so on.

Obviously it wouldn't be as good as a real person, but it'd be a nice leap to the 95%+ level of accuracy over the 80%ish on high performing commercial STT systems.

...and how do you know which word you meant (even if it's not garbled)?

The number of homonyms (and near-homonyms) in English in huge

It's been a major issue for some users of W3W (eg https://cybergibbons.com/security-2/why-what3words-is-not-su...)