|
Hey, speech ML researcher here. Make sure you have different recordings of different contexts. fifteen.ai's best TTS voices use ~90 min of utterances, some separated by emotion. If you're having her read a text, make sure it's engaging--we do a lot of unconscious voicing when reading aloud. Tbh, if she has a non-Anglophone accent, you're going to need more because the training data is biased towards UK/US speakers. If you want to read up on the basics, check out the SV2TTS paper: https://arxiv.org/pdf/1806.04558.pdf
Basically you use a speaker encoding to condition the TTS output. This paper/idea is used all over, even for speech-to-speech translation, with small changes. There's a few open-source version implementations but mostly outdated--the better ones are either private for business or privacy reasons. There's a lot of work on non-parallel transfer learning (aka subjects are saying different things) so TTS has progressed rapidly and most public implementations lag a bit behind the research. If you're willing to grok speech processing, I'd start with NeMo for overall simplicity--don't get distracted by Kaldi. Edit: Important note! Utterances are usually clipped of silence before/after so take that into account when analyzing corpus lengths. The quality of each utterance is much much more important than the length--fifteen.ai's TTS is so good primarily because they got fans of each character to collect the data. |
But obviously also attend to the human matters as well, eg spend time.