Hacker News new | ask | show | jobs
by kastnerkyle 2766 days ago
Worth noting that a big chunk of the core TTS code here is built on tools from other researchers like Ryuichi Yamamoto and Keith Ito, and they have great implementations to check out as well.

The best quality I have heard in OSS is probably [1] from Ryuichi using the Tacotron 2 implementation of Rayhane Mamah, which is loosely what NVidia based some of their baseline code on recently as well [3][4].

There's also a colab notebook for this stuff, so you can try it directly without any pain https://colab.research.google.com/github/r9y9/Colaboratory/b...

I also have my own pipeline for this (using some utilities from the above authors + a lot of my own hacks), for a forthcoming paper release here https://github.com/kastnerkyle/representation_mixing/tree/ma... , see the minimal demo. It has pretty fast sampling, but the audio quality is not as high as WaveNet. I'd really like to tie in with WaveGlow, but it's work in progress for me so far.

NOTE: None of these have voice adaptivity per se, but given a model which trains well already + a multispeaker dataset with IDs such as VCTK, a lot of things become possible as getting a baseline model and data pipeline for TTS is quite difficult.

[0] https://github.com/keithito/tacotron

[1] https://r9y9.github.io/blog/2018/05/20/tacotron2/

[2] https://github.com/Rayhane-mamah/Tacotron-2

[3] https://github.com/NVIDIA/waveglow

[4] https://github.com/NVIDIA/tacotron2

1 comments

Building that 5000+ hour dataset needed to train quality Speech to Text is a serious challenge, and presumably TTS has a similar threshold of audio needed.

IMO that is why it is critical to spread the word about Common Voice (a CC0 licensed voice corpus) and get a large variety of people contributing to it: https://voice.mozilla.org

Much less audio is potentially needed for TTS than ASR, however the spread and quality of the TTS dataset is critical which is one reason why just training on ASR datasets "in reverse" hasn't worked great. For example, commercial databases run ~25 to 50 hours, but the "coverage" of the language is usually very different from e.g. audiobooks, and focuses specifically on covering edge cases of the language. You can think of it like a 25 hour "support set" which covers as many cases as possible, and can also grow over time as users run into cases where the system fails.

This all gets worse if you want multi-speaker output of course - getting even a few speakers who all read the same large corpus is difficult. The two datasets I've gotten the most out of so far are "LJSpeech" (a subset of the LibriVox corpus), and the "Nancy Corpus / Blizzard 2013" dataset [0][1].

There's a pretty interesting corpus here that I hope to start using soon [2].

To me, the biggest issue / gap between commercial interests and publically available data is curation - TTS really hinges on well curated, clean data at least for now. And if that dataset has a very balanced coverage of triphones, that's even better.

I'd like to try on the voice.mozilla data, but given current stuggles on even 1 speaker, a truly "in the wild" set of many speakers seems pretty difficult if training from scratch. For voice cloning using pretrained weights it may be a different story.

[0] https://keithito.com/LJ-Speech-Dataset/

[1] https://www.synsig.org/index.php/Blizzard_Challenge_2013

[2] http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/