|
Yes, sort of. Thing is, many modern speech models actually learn an internal language model, so we're already kind of doing that. In languages and domains where massive amounts of training data is available (say, grammatically correct English), this internal language understanding is so good you don't need the external model[1]. On the other hand, throwing an additional language model like GPT and BERT into the mix can help if you don't have a ton of voice data. In my attempt to do this, a large portion of the improvement came from letting the language model read the previous sentences in the conversation[2]. AFAIK most commercial systems are blissfully unaware of your previous sentences, leading to conversations like "set an alarm"/"sure when?"/"eightam"/"your nearest ATM is...". A word of caution though: letting BERT/GPT edit the outputs also gives a (potentially) much more dangerous failure mode: if the speech signal is difficult to understand, the resulting transcript will be difficult for humans to identify as transcription failures. For example, "yeah, I dunno I haven't..." (read on a noisy phone line in an obscure dialect) was transcribed as "yeah yeah not that is I I am then" by the baseline speech system. After we let BERT edit the outputs, the transcript became "yeah that's not what I was saying...". Which, ironically, was definitely not what the person was saying. [1] https://arxiv.org/abs/1911.08460, page 9 [2] https://arxiv.org/abs/2110.02267 edit: clarify why previous sentences matter |