| HN Mirror

Much less audio is potentially needed for TTS than ASR, however the spread and quality of the TTS dataset is critical which is one reason why just training on ASR datasets "in reverse" hasn't worked great. For example, commercial databases run ~25 to 50 hours, but the "coverage" of the language is usually very different from e.g. audiobooks, and focuses specifically on covering edge cases of the language. You can think of it like a 25 hour "support set" which covers as many cases as possible, and can also grow over time as users run into cases where the system fails.

This all gets worse if you want multi-speaker output of course - getting even a few speakers who all read the same large corpus is difficult. The two datasets I've gotten the most out of so far are "LJSpeech" (a subset of the LibriVox corpus), and the "Nancy Corpus / Blizzard 2013" dataset [0][1].

There's a pretty interesting corpus here that I hope to start using soon [2].

To me, the biggest issue / gap between commercial interests and publically available data is curation - TTS really hinges on well curated, clean data at least for now. And if that dataset has a very balanced coverage of triphones, that's even better.

I'd like to try on the voice.mozilla data, but given current stuggles on even 1 speaker, a truly "in the wild" set of many speakers seems pretty difficult if training from scratch. For voice cloning using pretrained weights it may be a different story.

[0] https://keithito.com/LJ-Speech-Dataset/

[1] https://www.synsig.org/index.php/Blizzard_Challenge_2013

[2] http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/