|
|
|
|
|
by abakker
2671 days ago
|
|
Self reply with more questions/thoughts. based on what I know, it seems like the problem could break down as: 1. we have a lot of training data for the voices of white men reading stuff.
2. We have good models that already exist for removing background noise.
3. We might be able to build good models that could identify accents, gender, age variation.
4. we have good models for style transfer that work in the audio domain. Could we take an audiobook read by a white guy, and use a style transfer model to give him a german accent, and then use the german accented version as training data back into the speech recognition model? Could you use a reverse style transfer model to turn accented audio into non-accented audio (i.e. normalize it all to the place where we have the most training data) Could we use a combination of style transfer models to vastly expand the training data set, and then train the conversational systems? Or, are the style transfer models not good enough? Or do we not have training data for style transfers to turn the voices of white men into the voices of white men with german accents? I don't want to trivialize, but I'm genuinely curious how professionals are actually trying to solve this now? |
|