| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jgehring 2671 days ago
	When training speech recognition systems you want to use data that closely matches your target domain. Models trained on audiobooks read by professionals will not perform very well for transcribing conversational or spontaneous speech or if there is background noise.

1 comments

abakker 2671 days ago

But, if I understand correctly, systems can be trained separately on "this is background noise" and then apply those filters first, and then work with cleaned audio, right? I've been using krisp.ai for a few weeks and it has been fantastic at doing exactly that in real-time.

Regarding conversational speech, I get that. Books are definitely not conversational.

I guess the next question though, would be: is the objective to build a model that understands all words, or conversational speech? <novice> It seems like transfer learning on a model trained on audiobooks and then conversations would still be a good path, right? </novice>

link

jgehring 2671 days ago

You're right, these issues can also be tackled independently. Transfer learning can help, but my first guess would be that it's hard to get reasonable accuracy (= usable for applications) without hundreds of hours of conversational data. You could also attempt to directly modify the audiobook data by manually adding noises or performing other distortions.

In any case, for read speech in particular there are several corpora out there already, including the moderately large LibriSpeech corpus (1000hr). The state-of-the-art accuracy on read speech is also very good -- for example, domain-specific dictation systems have been commercially viable for quite some time. So while it's true that Audiobooks are a large untapped source, I think that there are other large-scale and richer options like YouTube or movies (i.e. videos with speech for which subtitles are available) that would be more useful to make progress towards good speech recognition systems.

link

anoncake 2670 days ago

> videos with speech for which subtitles are available

The subtitles often don't match what is spoken exactly.

link

abakker 2671 days ago

Self reply with more questions/thoughts. based on what I know, it seems like the problem could break down as:

1. we have a lot of training data for the voices of white men reading stuff. 2. We have good models that already exist for removing background noise. 3. We might be able to build good models that could identify accents, gender, age variation. 4. we have good models for style transfer that work in the audio domain.

Could we take an audiobook read by a white guy, and use a style transfer model to give him a german accent, and then use the german accented version as training data back into the speech recognition model? Could you use a reverse style transfer model to turn accented audio into non-accented audio (i.e. normalize it all to the place where we have the most training data) Could we use a combination of style transfer models to vastly expand the training data set, and then train the conversational systems?

Or, are the style transfer models not good enough? Or do we not have training data for style transfers to turn the voices of white men into the voices of white men with german accents?

I don't want to trivialize, but I'm genuinely curious how professionals are actually trying to solve this now?

link

anoncake 2671 days ago

> I guess the next question though, would be: is the objective to build a model that understands all words, or conversational speech? <novice> It seems like transfer learning on a model trained on audiobooks and then conversations would still be a good path, right? </novice>

Understanding all words is not the problem. I don't know if it's universal, but frequently, a speech-to-text model is actually two models: A voice model (mapping raw audio to phonemes) and a language model (which models what the language looks like, i.e. what sentences are likely and which words exist). So if you want the STT system to understand novels, include novels in the training data for the language model. You can then combine it with a voice model suitable for conversational speech/the user's accent/background noise.

link

novaRom 2671 days ago

Transfer learning is not guaranteed to work well. Most learned features even in the first layers look usually very different if trained in a clean environment. Background noise is not just a simple stationary signal, but very different audio patterns like music or other voices.

link