|
|
|
|
|
by metildaa
2766 days ago
|
|
Building that 5000+ hour dataset needed to train quality Speech to Text is a serious challenge, and presumably TTS has a similar threshold of audio needed. IMO that is why it is critical to spread the word about Common Voice (a CC0 licensed voice corpus) and get a large variety of people contributing to it: https://voice.mozilla.org |
|
This all gets worse if you want multi-speaker output of course - getting even a few speakers who all read the same large corpus is difficult. The two datasets I've gotten the most out of so far are "LJSpeech" (a subset of the LibriVox corpus), and the "Nancy Corpus / Blizzard 2013" dataset [0][1].
There's a pretty interesting corpus here that I hope to start using soon [2].
To me, the biggest issue / gap between commercial interests and publically available data is curation - TTS really hinges on well curated, clean data at least for now. And if that dataset has a very balanced coverage of triphones, that's even better.
I'd like to try on the voice.mozilla data, but given current stuggles on even 1 speaker, a truly "in the wild" set of many speakers seems pretty difficult if training from scratch. For voice cloning using pretrained weights it may be a different story.
[0] https://keithito.com/LJ-Speech-Dataset/
[1] https://www.synsig.org/index.php/Blizzard_Challenge_2013
[2] http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/